This document contains all of the commands necessary for the analyses performed in: Food or just a free ride? A meta-analysis reveals the global diversity of the Plastisphere
Please contact Robyn Wright with any questions.
Opening the .html version will allow you to view all of the code used and the analyses produced, while opening the .Rmd document will allow you to run everything if you open it within RStudio.
This provides links to all studies included in this meta-analysis, listed by the names that I have used to refer to them throughout.
If you do not have any of these packages already installed then you will need to install them. If you have problems with R finding Python then it might be worth explicitly telling R where to find the Python version you want to use, as described here or here (what worked for me was changing it in my Rprofile - in the folder given by running R.home() in the R studio console). I have made this notebook using R studio version 1.3.959 and Python version 3.8.3 and used python in a conda environment named r-reticulate.
load_all_R_packages <- function() {
library(reticulate)
library(kableExtra)
library(knitr)
library(exactRankTests)
library(nlme)
library(dplyr)
library(ggplot2)
library(vegan)
library(phyloseq)
library(ape)
library(microbiome)
library(ggnewscale)
#library(metacoder)
#library(ggtree)
library(compositions)
}
load_all_R_packages()
## Warning: package 'compositions' was built under R version 3.6.2
I have found conda-forge to be most successful for installing the majority of these packages.
import os
from Bio import SeqIO
from Bio.Alphabet import IUPAC
from Bio.Alphabet import generic_dna, generic_protein
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import csv
from datetime import datetime
import importlib
from itertools import chain
from lifelines.utils import concordance_index
import math
from matplotlib.lines import Line2D
import matplotlib as mpl
import matplotlib.lines as mlines
from matplotlib.offsetbox import AnchoredText
from matplotlib.patches import Patch
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes, TransformedBbox, BboxPatch, BboxConnector
import numpy as np
from optparse import OptionParser
import os
import pandas as pd
from pdf2image import convert_from_path, convert_from_bytes
import pickle
import random
from scipy.cluster import hierarchy
from scipy.stats import pearsonr
from scipy.spatial import distance
import scipy.spatial.distance as ssd
import scipy.stats as stats
from skbio import DistanceMatrix
from skbio.stats.distance import anosim
from skbio.stats.distance import permanova
from skbio.stats.composition import ancom
from sklearn.cluster import AgglomerativeClustering
from sklearn import manifold
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
I struggled a bit with scikit-bio, but this can be installed with:
conda install -c conda-forge scikit-bio
I have provided all of the data for this study in this Figshare file. These include the deblur outputs for each study as well as the merged QIIME2 representative sequences and feature table. If you want to either add a study to this or recreate the analyses used here, then you can skip ahead to section 3 and use the merged_representative_sequences.qza and merged_table.qza files for this.
Otherwise, details are given below on how to download the reads for all studies for which data is in the NCBI SRA. Data was provided directly by the authors for Curren2019, Kirstein2018, Kirstein2019 (this is available via NCBI for both Kirstein studies, but I asked for the data without primers removed) and Pollet2018. Frere2018 is available from: VAMPS portal (www.vamps. mbl.edu) under the project name LQM_MPLA_Bv4v5.
Here I use AmaralZettler2015 with accession SRP026054 as an example.
e.g. download from this link
For this you will need the SRA toolkit):
prefetch --option-file SRR_Acc_List.txt
These files by default get added to an ncbi/sra/ folder in your home directory and can then be moved wherever you like.
for i in folder_name/* ; do fastq-dump -split-files --gzip $i ; done
If you are not adding many files then this is probably easier to just do manually, but get in touch if you want details of how I did this. I renamed all files to follow this format:** AZ2015s001_S001_L001_R1_001.fastq.gz AZ2015s001_S001_L001_R2_001.fastq.gz AZ2015s002_S002_L001_R1_001.fastq.gz AZ2015s002_S002_L001_R2_001.fastq.gz
The most current version of this is here. As long as your sample names follow this format (i.e. sample name before the first underscore) then the subsequent parts of this analysis shouldn’t struggle even if your naming is different.**
I used 12 threads on a server for most of my analyses, but you can change this in these code chunks accordingly if you have less available. I think some parts will probably struggle with so many samples if you try to do this locally. This follows the Microbiome helper tutorial here. You can view all of the summary files (when you use the summarise commands) here.
Activate the QIIME2 environment (if you do not already have this installed then follow the instructions here:
conda activate qiime2-2019.10
Note that this uses Deblur. DADA2 could also be used, but given that we didn’t know whether all samples from each study came from the same sequencing run, we chose the per sample denoising approach of Deblur.
mkdir fastqc_out
fastqc -t 12 raw_data/*.fastq.gz -o fastqc_out
multiqc fastqc_out/
You should now look at the multiqc_report.html to ensure that everything is high enough quality to proceed.
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path raw_data/ \
--output-path reads.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
The primer sequences shown here are for 341F and 802R - these will need changing if you have used different primers:
qiime cutadapt trim-paired \
--i-demultiplexed-sequences reads.qza \
--p-cores 12 \
--p-front-f CCTACGGGNGGCWGCAG \
--p-front-r GACTACHVGGGTATCTAATCC \
--p-discard-untrimmed \
--p-no-indels \
--o-trimmed-sequences reads_trimmed.qza
qiime demux summarize \
--i-data reads_trimmed.qza \
--o-visualization reads_trimmed_summary.qzv
If the reads were already trimmed then just use reads.qza as the input here:
qiime vsearch join-pairs \
--i-demultiplexed-seqs reads_trimmed.qza \
--o-joined-sequences reads_joined.qza
If too many reads were removed then you may need to play around with some of the other options at here:
qiime demux summarize \
--i-data reads_joined.qza \
--o-visualization reads_joined_summary.qzv
qiime quality-filter q-score-joined \
--i-demux reads_joined.qza \
--o-filter-stats filt_stats.qza \
--o-filtered-sequences reads_joined_filtered.qza
You should look at the positions where the quality starts to drop below 30 and use these as trim lengths:
qiime demux summarize \
--i-data reads_joined_filtered.qza \
--o-visualization reads_joined_filtered_summary.qzv
You can remove the –p-left-trim-len if you don’t need to remove any from this end:
qiime deblur denoise-16S \
--i-demultiplexed-seqs reads_joined_filtered.qza \
--p-trim-length 402 \
--p-left-trim-len 0 \
--p-sample-stats \
--p-jobs-to-start 12 \
--p-min-reads 1 \
--output-dir deblur_output_quality
qiime feature-table summarize \
--i-table deblur_output_quality/table.qza \
--o-visualization deblur_table_summary.qzv
mv merged_representative_sequences.qza previous_merged_representative_sequences.qza
mv merged_table.qza previous_merged_table.qza
You will need to replace ‘your_folder_name’ with the folder that contains your tables to be added (when I did this for all studies, I just added additional –i-tables table_name.qza lines):
qiime feature-table merge \
--i-tables your_folder_name/deblur_output_quality/table.qza \
--i-tables previous_merged_table.qza \
--o-merged-table merged_table.qza
You will again need to replace ‘your_folder_name’ with the folder that contains your sequences to be added (when I did this for all studies, I just added additional –i-data representative_sequences_name.qza lines):
qiime feature-table merge-seqs \
--i-data your_folder_name/deblur_output/representative_sequences.qza \
--i-data previous_merged_representative_sequences.qza \
--o-merged-data merged_representative_sequences.qza
Now that all of the samples that we are looking at are combined into the merged sequences and table files, we can classify and analyze them.
qiime feature-table summarize \
--i-table merged_table.qza \
--o-visualization merged_table_summary.qzv
qiime feature-classifier classify-sklearn \
--i-reads merged_representative_sequences.qza \
--i-classifier ref_alignments/classifier_silva_132_99_16S.qza \
--p-n-jobs 12 \
--output-dir taxa
As these sequences come from different 16S regions, I downloaded the full length 16S classifier from here. There is now an updated SILVA version, but I used the Silva 132 classifier (this can only improve upon classification accuracy, so I recommend using the latest one).
qiime tools export \
--input-path taxa/classification.qza \
--output-path taxa
qiime feature-table filter-features \
--i-table merged_table.qza \
--p-min-frequency 10 \
--p-min-samples 1 \
--o-filtered-table merged_table_filtered.qza
qiime taxa filter-table \
--i-table merged_table_filtered.qza \
--i-taxonomy taxa/classification.qza \
--p-include D_1__ \
--p-exclude mitochondria,chloroplast \
--o-filtered-table merged_table_filtered_contamination.qza
qiime feature-table summarize \
--i-table merged_table_filtered_contamination.qza \
--o-visualization merged_table_filtered_contamination_summary.qzv
Now find out how many features you have as well as the maximum sample depth (this is the “Maximum Frequency” in the “Frequency per sample” section).
qiime diversity alpha-rarefaction \
--i-table merged_table_filtered_contamination.qza \
--p-max-depth 995391 \
--p-steps 20 \
--p-metrics 'observed_otus' \
--o-visualization merged_rarefaction_curves.qzv
qiime feature-table filter-samples \
--i-table merged_table_filtered_contamination.qza \
--p-min-frequency 2000 \
--o-filtered-table merged_table_final.qza
qiime feature-table rarefy \
--i-table merged_table_final.qza \
--p-sampling-depth 2000 \
--o-rarefied-table merged_table_final_rarefied.qza
qiime feature-table filter-seqs \
--i-data merged_representative_sequences.qza \
--i-table merged_table_final_rarefied.qza \
--o-filtered-data representative_sequences_final_rarefied.qza
qiime tools export \
--input-path representative_sequences_final_rarefied.qza \
--output-path exports
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' taxa/taxonomy.tsv
qiime tools export \
--input-path merged_table_final_rarefied.qza \
--output-path exports
biom add-metadata \
-i exports/feature-table.biom \
-o exports/feature-table_w_tax.biom \
--observation-metadata-fp taxa/taxonomy.tsv \
--sc-separated taxonomy
biom convert \
-i exports/feature-table_w_tax.biom \
-o exports/feature-table_w_tax.txt \
--to-tsv \
--header-key taxonomy
qiime fragment-insertion sepp \
--i-representative-sequences representative_sequences_final_rarefied.qza \
--i-reference-database ref_alignments/sepp-refs-silva-128.qza \
--o-tree insertion_tree_rarefied.qza \
--o-placements insertion_placements_rarefied.qza \
--p-threads 12
You can download the reference file here. At the time of writing, this still used Silva 128, but I would recommend using an updated version if there is one.
qiime tools export \
--input-path insertion_tree_rarefied.qza \
--output-path exports
for i in exports/* ; cp $i paper_data_20-04-14/qiime_output/; done
These will give some metrics and QIIME2 visualizations that can be viewed on the QIIME2 website, but if you include all samples that we have, then the website won’t cope too well with the >2000 samples) To do these, you will need to upload a metadata file containing all samples. You can take the metadata file that I linked to above for this:
qiime diversity core-metrics-phylogenetic \
--i-table merged_table_final_rarefied.qza \
--i-phylogeny insertion_tree_rarefied.qza \
--p-sampling-depth 2000 \
--m-metadata-file metadata.txt \
--p-n-jobs 12 \
--output-dir diversity
qiime tools export \
--input-path diversity/weighted_unifrac_distance_matrix.qza \
--output-path diversity
mv diversity/distance-matrix.tsv exports/weighted_unifrac.tsv
qiime tools export \
--input-path diversity/unweighted_unifrac_distance_matrix.qza \
--output-path diversity
mv diversity/distance-matrix.tsv exports/unweighted_unifrac.tsv
mkdir test_rarefy
cp merged_table_final.qza test_rarefy/
qiime feature-table filter-seqs \
--i-data merged_representative_sequences.qza \
--i-table merged_table_final.qza \
--o-filtered-data representative_sequences_final.qza
cp representative_sequences_final.qza test_rarefy/
cd test_rarefy
qiime fragment-insertion sepp \
--i-representative-sequences representative_sequences_final.qza \
--i-reference-database /home/robyn/other/ref_alignments/sepp-refs-silva-128.qza \
--o-tree insertion_tree_not_norm.qza \
--o-placements insertion_placements_not_norm.qza \
--p-threads 24
qiime tools export \
--input-path representative_sequences_final.qza \
--output-path exports
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' taxa/taxonomy.tsv
qiime tools export \
--input-path merged_table_final.qza \
--output-path exports
biom add-metadata \
-i exports/feature-table.biom \
-o exports/feature-table_w_tax.biom \
--observation-metadata-fp taxa/taxonomy.tsv \
--sc-separated taxonomy
biom convert \
-i exports/feature-table_w_tax.biom \
-o exports/feature-table_w_tax.txt \
--to-tsv \
--header-key taxonomy
qiime tools export \
--input-path insertion_tree_not_norm.qza \
--output-path exports
Note that we have still removed samples with below 2000 samples from these tables, regardless of the normalisation (or lack of) used.
In this study, I use PICRUSt2 to predict the metagenome content of all Plastisphere samples. As the default reference files that PICRUSt2 uses don’t contain some of the genes for PET degradation (e.g. PETase), I have added these to the reference database. To do this, I downloaded all genomes that are included in PICRUSt2 (or as many as possible - not quite all are available), made an HMM for the genes of interest (i.e. PETase, tphA, etc.) and then ran this HMM on all PICRUSt2 genomes. I then parse the output to determine how many copies of these genes each genome has, and add this as a column to the default PICRUSt2 reference file.
The PICRUSt2 genomes will need to be downloaded, decompressed and saved somewhere locally. They can be downloaded from this Figshare file. To decompress:
tar -xf path_to_file/JGI_PICRUSt_genomes.tar.bz2
conda install -c biocore hmmer
conda install biopython
The HMMs that are currently shown in the HMM/ folder were made from the .fasta files in the ‘hmms_to_make’ folder. To make these of your own, you can follow these steps.
hmmbuild PETase_DNA.hmm PETase_DNA.sto
(VI) Move the .hmm file to the ‘hmms/’ folder
picrust_seqs = 'JGI_PICRUSt_genomes.fasta'
hmms = os.listdir(os.getcwd()+'/hmms/')
ko = 'ko.txt'
os.system('gunzip '+ko+'.gz')
try: os.mkdir('hmms_out')
except: didnt_make = True
ko_data = pd.read_csv(ko, header=0, index_col=0, sep='\t')
for hmm in hmms:
os.system('nhmmer hmms/'+hmm+' '+picrust_seqs+' > hmms_out/'+hmm[:-4]+'.out')
You can open any of the files in the hmms_out folder if you want to check whether you have any hits that are above the inclusion threshold (and whether this fits what you would have expected)
hmms_out = os.listdir(os.getcwd()+'/hmms_out')
main_dir = os.getcwd()
genomes = list(ko_data.index.values)
genomes = [str(genomes[i]).replace('-cluster', '') for i in range(len(genomes))]
for hmm in hmms_out:
included_genomes = []
with open(main_dir+'/hmms_out/'+hmm, 'rU') as f:
contents = f.read()
row, rows = '', []
for a in range(len(contents)-1):
if contents[a:a+1] == '\n':
if row == ' ------ inclusion threshold ------':
break
rows.append(row)
row = ''
else:
row += contents[a]
after_start, other_count = False, 0
for r in range(len(rows)):
if after_start:
block = 0
this_genome = ''
for b in range(1, len(rows[r])):
if rows[r][b-1] == ' ' and rows[r][b] != ' ':
block += 1
if block == 4 and rows[r][b] != ' ':
this_genome += rows[r][b]
if this_genome != '':
included_genomes.append(this_genome)
count = 0
for a in range(len(rows[r])):
if rows[r][a] == '-':
count += 1
if count > 40:
after_start = True
continue
for a in range(len(included_genomes)):
if included_genomes[a][-11:] == 'Description':
included_genomes[a] = included_genomes[a][:-11]
this_col = []
for g in genomes:
c1 = included_genomes.count(g)
c2 = included_genomes.count(g[:-8])
this_col.append(c1+c2)
ko_data[hmm[:-4]] = this_col
ko_data.to_csv('ko_all.txt', sep='\t')
You can now check the ko_all.txt file, but there should be new columns titled with your HMM names and counts of how many times these genes are in each of your genomes in the rows. If you want to use these with the rest of the Plastisphere metaanalysis then you should replace the ‘ko_all.txt’ file in the picrust folders in both of the folders inside the ‘all_output_and_recreate’ folder (downloaded from here)
This can be used to recreate all of the analyses found in the Plastisphere meta-analysis paper, or alternatively to re-do these analyses while also incorporating additional data.
This goes through all analyses run, but you can download all of the analysis files used and created from this Figshare file. In particular, taking the random_forest and ancom files will make a very big difference to how long this takes to run. If you just want to re-make the figures (possibly with changes) then you can add all files but remove the figures folder.
It is expected that the base directory contains, at a minimum:The majority of these code chunks contain statements to ensure that they aren’t run if the output already exists, but if you want to run them anyway then you should change these/move the files already created to somewhere else. Many steps will rely on the output of previous steps, although I have tried to ensure that once run they will save their output so that they won’t need to be re-run multiple times (they are often very time/computationally intensive). The sections with an asterisk will need running each time you want to run anything (and after the initial run should only take a few seconds to run), but the others are only if you are running the analyses for that section.
Python:
basedir = '/Users/robynwright/Documents/OneDrive/Papers_writing/Plastisphere_Meta-analysis/test_recreate_analyses/recreate_analyses/'
ft_tax, meta_fn, seqs, study_locs, study_dates, map_img = basedir+'qiime_output/feature-table_w_tax.txt', basedir+'metadata.txt', basedir+'qiime_output/dna-sequences.fasta', basedir+'Study_location.csv', basedir+'Study_dates.csv', basedir+'world_map.jpg'
n_jobs, est = 10, 10000
seed = 3 #random seed
fs_title, fs_main, fs_small = 14, 10, 8 #font sizes to use in figures
color_env = {'marine':'#03A498', 'freshwater':'#03B1FC', 'aquatic':'#9A75FC', 'terrestrial':'#FD9A64'} #colors for plotting each environment
label_env = ['Marine', 'Freshwater', 'Aquatic', 'Terrestrial'] #labels for each environment
ext, dpi = '.png', 600 #extension and dots per inch to save figures with
color_source = {'aliphatic':'#5F9EA0', 'biofilm':'#FCA94A', 'other plastic':'#8B008B', 'unknown plastic':'#3593FC', 'planktonic':'#F9E79F', 'blank':'gray', 'early':'#FBC704', 'late':'#900C3F', 'collection':'gray'}
phylo_level_names = {0:'kingdoms', 1:'phyla', 2:'classes', 3:'orders', 4:'families', 5:'genera', 6:'species', 7:'ASVs'}
rename_plots = {'AmaralZettler':'Amaral Zettler $et$ $al$. 2015', 'AriasAndres':'Arias-Andres $et$ $al$. 2018', 'Canada':'Canada $et$ $al$. 2020', 'Curren':'Curren & Leong 2019', 'Delacuvellerie':'Delacuvellerie $et$ $al$. 2019', 'DeTender':'De Tender $et$ $al$. 2015', 'DeTenderB':'De Tender $et$ $al$. 2017', 'DussudHudec':'Dussud, Hudec $et$ $al$. 2018', 'DussudMeistertzheim':'Dussud, Meistertzheim $et$ $al$. 2018', 'ErniCassola':'Erni-Cassola $et$ $al$. 2019', 'Esan':'Esan $et$ $al$. 2019', 'Frere':'Frere $et$ $al$. 2018', 'Hoellein':'Hoellein $et$ $al$. 2014', 'HoelleinB':'Hoellein $et$ $al$. 2017', 'Jiang':'Jiang $et$ $al$. 2018', 'Kesy':'Kesy $et$ $al$. 2019', 'Kirstein':'Kirstein $et$ $al$. 2018', 'KirsteinB':'Kirstein $et$ $al$. 2019', 'McCormick':'McCormick $et$ $al$. 2014', 'McCormickB':'McCormick $et$ $al$. 2016', 'Oberbeckmann':'Oberbeckmann $et$ $al$. 2016', 'OberbeckmannB':'Oberbeckmann $et$ $al$. 2018', 'Ogonowski':'Ogonowski $et$ $al$. 2018', 'Parrish':'Parrish $et$ $al$. 2019', 'Pinto':'Pinto $et$ $al$. 2019', 'Pollet':'Pollet $et$ $al$. 2018', 'Rosato':'Rosato $et$ $al$. 2020', 'Syranidou':'Syranidou ', 'SyranidouPE':'Syranidou $et$ $al$. 2017a', 'SyranidouPS':'Syranidou $et$ $al$. 2017b', 'Tagg':'Tagg $et$ $al$. 2019', 'Woodall':'Woodall $et$ $al$. 2018', 'Wu':'Wu $et$ $al$. 2019', 'Xu':'Xu $et$ $al$. 2019', 'Zhang':'Zhang $et$ $al$. 2019', 'WaterOrSediment':'Water or Sediment', 'LabOrField':'Laboratory or Field', 'IncubationOrCollection':'Incubation or Collection', 'MaterialType':'Material type', 'PlasticTypeSpecific':'Plastic type (specific)', 'PlasticTypeGeneral':'Plastic type (general)', 'DEPTH':'Depth', 'IncubationTime':'Incubation time (specific)', 'IncubationGeneral':'Incubation time (general)', 'PrimerPair':'Primer pair', 'DNAExtraction':'DNA extraction method', 'lab':'Laboratory', 'not_plastic':'Not plastic', 'aged_oxope':'Aged Oxo-PE', 'freeliving':'Free living', 'particleassociated':'Particle associated', 'oxope':'Oxo-PE', 'rinse_pe':'PE rinse water', 'rinse_ps':'PS rinse water', 'rinse_wood':'Wood rinse water', 'bhet':'BHET', 'hdpe':'HDPE', 'ldpe':'LDPE', 'na':'NA', 'pa':'PA', 'pe':'PE', 'pes':'PES', 'pestur':'PESTUR', 'pet':'PET', 'phbv':'PHBV', 'pla':'PLA', 'pp':'PP', 'ps':'PS', 'pvc':'PVC', 'san':'SAN', 'w_pe':'Weathered PE', '10:14':'10:14 light:dark', '12:12':'12:12 light:dark', '16:08':'16:08 light:dark', '27F_519R':'27F-519R', '319F_806R':'319F-806R', '338F_806R':'338F-806R', '341F_785R':'341F-785R', '341F_802R':'341F-802R', '341F_806R':'341F-806R', '515F_806R':'515F-806R', '515FY_926R':'515FY-926R', '518F_926R':'518F-926R', '543F_783R':'543F-783R', '967F_1064R':'967F-1064R', 'B969F_BA1406R':'B969F-BA1406R', 'redextract_sigma':'REDExtract-$N$-AmpTM', 'gentra_puregene':'Gentra Puregene', 'purelink':'PureLink', 'powersoil':'PowerSoil', 'phenol_chloroform':'Phenol-Chloroform', 'powerbiofilm':'PowerBiofilm', 'ultraclean_soil':'UltraClean soil', 'fastdna_soil':'FastDNA soil', 'orders':'Order', 'classes':'Class', 'phyla':'Phylum', 'genera':'Genera', 'families':'Family', 'species':'Species', 'ASVs':'ASV', 'kingdoms':'Kingdom', 'PlasticOnly':'Plastic only', '534f_783r':'534F-783R', 'Phenol_chloroform':'Phenol-chloroform'}
meta_name_ind = {'Study':0, 'Latitude':1, 'Longitude':2, 'Environment':3, 'WaterOrSediment':4, 'LabOrField':5, 'IncubationOrCollection':6, 'Source':7, 'MaterialType':8, 'PlasticTypeSpecific':9, 'PlasticTypeGeneral':10, 'DEPTH':11, 'IncubationTime':12, 'IncubationGeneral':13, 'Temperature':14, 'Salinity':15, 'Light':16, 'Season':17, 'PrimerPair':18, 'DNAExtraction':19, 'PlasticOnly':20}
name_dict = {'La2020':'Latva\n$et$ $al$. 2020', 'AZ2015':'Amaral-Zettler\n$et$ $al$. 2015', 'AA2018':'Arias-Andres\n$et$ $al$. 2018', 'Ca2020':'Canada\n$et$ $al$. 2020', 'Cu2019':'Curren & Leong\n2019', 'De2019':'Delacuvellerie\n$et$ $al$. 2019', 'DT2015':'De Tender\n$et$ $al$. 2015', 'DT2017':'De Tender\n$et$ $al$. 2017', 'DH2018':'Dussud \n$et$ $al$. 2018a', 'DM2018':'Dussud\n$et$ $al$. 2018b', 'EC2019':'Erni-Cassola\n$et$ $al$. 2019', 'Es2019':'Esan\n$et$ $al$. 2019', 'Fr2018':'Frere\n$et$ $al$. 2018', 'Ho2014':'Hoellein\n$et$ $al$. 2014', 'Ho2017':'Hoellein\n$et$ $al$. 2017', 'Ji2018':'Jiang\n$et$ $al$. 2018', 'Ke2019':'Kesy\n$et$ $al$. 2019', 'Ki2018':'Kirstein\n$et$ $al$. 2018', 'Ki2019':'Kirstein\n$et$ $al$. 2019', 'MC2014':'McCormick\n$et$ $al$. 2014', 'MC2016':'McCormick\n$et$ $al$. 2016', 'Ob2016':'Oberbeckmann\n$et$ $al$. 2016', 'Ob2018':'Oberbeckmann\n$et$ $al$. 2018', 'Og2018':'Ogonowski\n$et$ $al$. 2018', 'Pi2019':'Pinto\n$et$ $al$. 2019', 'Po2018':'Pollet\n$et$ $al$. 2018', 'Ro2020':'Rosato\n$et$ $al$. 2020', 'Sy2019':'Syranidou\n$et$ $al$. 2019', 'SyPE20':'Syranidou\n$et$ $al$. 2017a', 'SyPS20':'Syranidou\n$et$ $al$. 2017b', 'Ta2019':'Tagg\n$et$ $al$. 2019', 'Wo2018':'Woodall\n$et$ $al$. 2018', 'Wr2019':'Wright\n$et$ $al$. 2020', 'Wu2019':'Wu\n$et$ $al$. 2019', 'Xu2019':'Xu\n$et$ $al$. 2019', 'Zh2019':'Zhang\n$et$ $al$. 2019', 'Br2016':'Bryant\n$et$ $al$. 2016', 'Pin201':'Pinnell\n$et$ $al$. 2019', 'Pa2019':'Parrish\n$et$ $al$. 2019'}
name_dict_2 = {'La2020':'Latva $et$ $al$. 2020', 'AZ2015':'Amaral-Zettler $et$ $al$. 2015', 'AA2018':'Arias-Andres $et$ $al$. 2018', 'Ca2020':'Canada $et$ $al$. 2020', 'Cu2019':'Curren & Leong 2019', 'De2019':'Delacuvellerie $et$ $al$. 2019', 'DT2015':'De Tender $et$ $al$. 2015', 'DT2017':'De Tender $et$ $al$. 2017', 'DH2018':'Dussud $et$ $al$. 2018a', 'DM2018':'Dussud $et$ $al$. 2018b', 'EC2019':'Erni-Cassola $et$ $al$. 2019', 'Es2019':'Esan $et$ $al$. 2019', 'Fr2018':'Frere $et$ $al$. 2018', 'Ho2014':'Hoellein $et$ $al$. 2014', 'Ho2017':'Hoellein $et$ $al$. 2017', 'Ji2018':'Jiang $et$ $al$. 2018', 'Ke2019':'Kesy $et$ $al$. 2019', 'Ki2018':'Kirstein $et$ $al$. 2018', 'Ki2019':'Kirstein $et$ $al$. 2019', 'MC2014':'McCormick $et$ $al$. 2014', 'MC2016':'McCormick $et$ $al$. 2016', 'Ob2016':'Oberbeckmann $et$ $al$. 2016', 'Ob2018':'Oberbeckmann $et$ $al$. 2018', 'Og2018':'Ogonowski $et$ $al$. 2018', 'Pi2019':'Pinto $et$ $al$. 2019', 'Po2018':'Pollet $et$ $al$. 2018', 'Ro2020':'Rosato $et$ $al$. 2020', 'Sy2019':'Syranidou $et$ $al$. 2019', 'SyPE20':'Syranidou $et$ $al$. 2017a', 'SyPS20':'Syranidou $et$ $al$. 2017b', 'Ta2019':'Tagg $et$ $al$. 2019', 'Wo2018':'Woodall $et$ $al$. 2018', 'Wr2019':'Wright $et$ $al$. 2020', 'Wu2019':'Wu $et$ $al$. 2019', 'Xu2019':'Xu $et$ $al$. 2019', 'Zh2019':'Zhang $et$ $al$. 2019', 'Br2016':'Bryant $et$ $al$. 2016', 'Pin201':'Pinnell $et$ $al$. 2019', 'Pa2019':'Parrish $et$ $al$. 2019'}
name_env = {'AZ2015':'marine', 'AA2018':'freshwater', 'Ca2020':'aquatic', 'Cu2019':'marine', 'De2019':'marine', 'DT2015':'marine', 'DT2017':'marine', 'DH2018':'marine', 'DM2018':'marine', 'EC2019':'marine', 'Es2019':'terrestrial', 'Fr2018':'marine', 'Ho2014':'freshwater', 'Ho2017':'freshwater', 'Ke2019':'aquatic', 'Ki2018':'marine', 'Ki2019':'marine', 'MC2014':'freshwater', 'MC2016':'freshwater', 'Ob2016':'marine', 'Ob2018':'aquatic', 'Og2018':'marine', 'Pi2019':'marine', 'Po2018':'marine', 'Ro2020':'marine', 'Sy2019':'marine', 'SyPE20':'marine', 'SyPS20':'marine', 'Ta2019':'aquatic', 'Wo2018':'marine', 'Wr2019':'marine', 'Wu2019':'freshwater', 'Xu2019':'marine', 'Zh2019':'terrestrial', 'Pa2019':'aquatic'}
def open_csv(fn): #open csv with file name fn, returning the data as rows
'''
Opens csv file and returns the data as rows
Takes as input:
- fn (.csv file name)
Returns:
- rows of the file as a list
'''
with open(fn, 'rU') as f:
rows = []
for row in csv.reader(f):
rows.append(row)
return rows
def open_txt(fn):
'''
Opens text file and returns the data as rows
Takes as input:
- fn (.txt file name - assumes that data is tab delimited and lines terminated with \n)
Returns:
- rows of the file as a list
'''
with open(fn, 'rU') as f:
rows = []
for row in csv.reader(f, delimiter='\t', lineterminator='\n'):
rows.append(row)
return rows
def open_pickle(fn):
'''
Opens python data object file and returns the data as a python object
Takes as input:
- fn (file name for python object)
Returns:
- python object
'''
with open(fn, 'rb') as pick:
pick = pickle.load(pick)
return pick
def write_csv(fn, data):
'''
Save a .csv file of the data given
Takes as input:
- fn (the name with which to save the csv file)
- data (a list of lists - the data to go in the csv file)
'''
with open(fn, 'w') as f:
writer = csv.writer(f)
for row in data:
writer.writerow(row)
return
def write_txt(fn, data):
'''
Save a .txt file (tab delimited with \n to break lines) of the data given
Takes as input:
- fn (the name with which to save the txt file)
- data (a list of lists - the data to go in the txt file)
'''
with open(fn, 'w') as f:
writer = csv.writer(f, delimiter='\t', lineterminator='\n')
for row in data:
writer.writerow(row)
return
def write_pickle(fn, data): #write pickle python object with name fn and the data in data
'''
Write a pickle python object with the data given
Takes as input:
- fn (the name with which to save the data)
- data (the python object to save)
'''
with open(fn, 'wb') as f:
pickle.dump(data, f)
return
def open_and_sort(fn):
'''
This is a function to sort the columns of a .csv file and return the resulting sorted dataframe
Takes as input:
- fn (the name of the .csv file)
If 'unifrac' is in the name then it will sort the rows as well as the columns. Otherwise only the columns will be sorted
Returns:
- df (a pandas dataframe of the sorted .csv file)
'''
df = pd.read_csv(fn, header=0, index_col=0) #open the file
df.sort_index(inplace=True) #sort the column names
if 'unifrac' in fn: #if it is a unifrac file
df.sort_index(axis=1, inplace=True) #also sort the rows
return df
def get_meta(meta): #get the information contained in the meta file as a dictionary
'''
Function to get the information contained in the metadata file as a dataframe and a dictionary
Takes as input:
- meta (.txt tab delimited file with sample names as rows and metadata categories as columns)
Returns:
- meta (the metadata file as a list)
- meta_names (the names of the metadata categories)
- meta_dict (a dictionary containing sample names as keys and all metadata stored)
'''
meta = open_txt(meta) #open the file
meta_names = meta[0] #save the column names
del meta[0] #delete the column names
meta_dict = {} #create a dictionary of all information contained, with sample names as dictionary keys
for a in range(len(meta)):
meta_dict[meta[a][0]] = meta[a][1:]
for b in range(len(meta[a])):
if meta[a][b] not in rename_plots:
try:
float(meta[a][b])
except:
rename_plots[meta[a][b]] = meta[a][b].capitalize()
return meta, meta_names, meta_dict
def get_meta_df(meta, meta_names, ft_samples):
'''
Function to check that only the samples that were included in the QIIME analysis (after filtering, rarefaction, etc) remain in the metadata dataframe
Takes as input:
- meta (the metadata file as a list (one of the outputs of the get_meta function)
- meta_names (list of metadata categories (also output from the get_meta function))
- ft_samples (the sample names from the feature table)
Returns:
- meta_df (a dataframe containing only the samples that are in the feature table)
'''
meta_df = pd.DataFrame(meta, columns=meta_names) #open the meta file
meta_df.index = meta_df['#SampleID'] #change the index to be the sample ID column
meta_df.drop('#SampleID', axis=1, inplace=True) #remove this column
sn_meta = list(meta_df.index.values) #get a list of all sample names
for a in sn_meta: #for each of the samples
if a not in ft_samples: #if it's not also in the samples list
meta_df.drop(a, axis=0, inplace=True) #remove it from the meta dataframe
return meta_df
def get_ASV_dict(ft, seqs, basedir):
'''
Function to rename the ASVs with a number (rather than the random sequence of letters/numbers that is standard for QIIME2 output)
Takes as input:
- ft (feature table pandas dataframe with sample names as columns and ASVs as rows)
- seqs (name of the sequences fasta file that contains sequences for all of the ASVs)
Returns:
- ASV_dict (a dictionary containing the previous ASV names as keys and the new ASV names as data)
'''
asvs = list(ft.index.values)
ASV_dict = {}
for a in range(len(asvs)):
ASV_dict[asvs[a]] = 'ASV'+str(a).zfill(4)
seqs_rename = [] #create a list to add the sequence records to
for record in SeqIO.parse(seqs, "fasta"): #open the sequences file and loop through it
if record.id in asvs: #if the record id matches one found in the asv list
record.id = ASV_dict[record.id]
seqs_rename.append(record) #add the record to the list
SeqIO.write(seqs_rename, basedir+"sequences_agglom_renamed.fasta", "fasta") #save the new list of sequence records
return ASV_dict
def mark_inset(parent_axes, inset_axes, loc1a=1, loc1b=1, loc2a=2, loc2b=2, **kwargs):
rect = TransformedBbox(inset_axes.viewLim, parent_axes.transData)
pp = BboxPatch(rect, fill=False, **kwargs)
parent_axes.add_patch(pp)
p1 = BboxConnector(inset_axes.bbox, rect, loc1=loc1a, loc2=loc1b, **kwargs)
inset_axes.add_patch(p1)
p1.set_clip_on(False)
p2 = BboxConnector(inset_axes.bbox, rect, loc1=loc2a, loc2=loc2b, **kwargs)
inset_axes.add_patch(p2)
p2.set_clip_on(False)
return pp, p1, p2
def transform_for_NMDS(similarities, n_jobs=1): #transform the similarity matrix to 2D space (n_components=2)
'''
For a similarity matrix, calculate the best fit of points in 2D
Takes as input:
- similarities (a distance matrix with no row or sample names)
- n_jobs (number of processors to use for constraining - doesn't seem to work so we leave this as the default 1)
'''
similarities = np.nan_to_num(similarities)
X_true = similarities.iloc[0:].values
mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, random_state=seed, dissimilarity="precomputed", n_jobs=n_jobs)
pos = mds.fit(similarities).embedding_
nmds = manifold.MDS(n_components=2, metric=False, max_iter=3000, eps=1e-12, dissimilarity="precomputed", random_state=seed, n_jobs=n_jobs, n_init=1)
npos = nmds.fit_transform(similarities, init=pos)
npos *= np.sqrt((X_true ** 2).sum()) / np.sqrt((npos ** 2).sum())
clf = PCA()
npos = clf.fit_transform(npos)
return npos, nmds.stress_
def format_R(ft, basedir):
'''
Function that transforms the feature table output by QIIME2 to a version that can be used by the R agglomerate script (phyloseq and unifrac)
Takes as input:
- feature table.txt with samples as columns and ASVs as rows
Gives as output:
- ft as a pandas dataframe with sorted sample names
- tax_dict - a dictionary containing the full taxonomy for each ASV
- saves various files: feature_table.csv, taxonomy.csv, taxonomy_name_only.csv, random_forest/taxonomy_name_only.csv, tax_dict.dictionary
'''
ft = open_txt(ft) #open the text file
del ft[0] #delete the first line of the file that tells you it was created from a .biom file
tax_dict = {} #create a dictionary that will have all of the taxon information
tax_file = [['OTUID', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species', 'Species name']] #create a list that will have all of the taxon information
tax_name_only = [['OTUID', 'Species name']]
for a in range(len(ft)): #for each row of the feature table
if a > 0: #if it's not the header row
string = ft[a][-1].split(';') #split this taxon information based on semi-colons (the default format from QIIME2)
new_string = []
for b in range(len(string)):
if b < 5 and 'Alteromonas' in string[b]:
break
if 'uncultured' in string[b]:
continue
elif 'Ambiguous' in string[b]:
continue
elif 'Uncultured' in string[b]:
continue
elif 'ambiguous' in string[b]:
continue
else:
if string[b][:3] == ' D_':
string[b] = string[b][6:]
elif string[b][:2] == 'D_':
string[b] = string[b][5:]
new_string.append(string[b])
sp_name = new_string[-1]
while len(new_string) < 7:
new_string.append('')
new_string.append(sp_name)
tax_name_only.append([ft[a][0], sp_name])
tax_dict[ft[a][0]] = new_string #creates the taxon in the taxa dictionary
tax_file.append([ft[a][0]]+new_string) #add this asv name and taxon information
ft[a] = ft[a][:-1] #remove the taxon information from the feature table
ft[0][0] = 'OTUID' #Rename the first cell to match the expected input in R
write_csv(basedir+'feature_table.csv', ft) #write the new feature table to .csv
ft = pd.read_csv(basedir+'feature_table.csv', header=0, index_col=0)
ft.sort_index(inplace=True)
ft.sort_index(axis=1, inplace=True)
write_csv(basedir+'taxonomy.csv', tax_file) #write the taxonomy to .csv
write_csv(basedir+'taxonomy_name_only.csv', tax_name_only) #write the taxonomy to .csv
write_pickle(basedir+'tax_dict.dictionary', tax_dict) #write the taxon information to a python object
return ft, tax_dict
def study_map_metrics(dates, locs, basedir, map_img, meta_df):
'''
Function to plot and save Figure 1 map (cumulative number of studies at the beginning of 2020 and the world map showing all of these studies) and metrics (number of samples included for each environment, plastic type and whether these were from incubations or collections, i.e. Figure 1C, 1D, 1E)
It uses files that are not the metadata file in order to also include those studies with data that wasn't accessible
Takes as input:
- dates (file containing details of the numbers of studies published in each environment in each year, and whether the data for these studies are accessible, named Study_dates.csv in the data to be downloaded for this study)
- locs (file containing details of study locations for all studies published up to 2020, including whether the data was accessible for each, named Study_location.csv in the data to be downloaded for this study)
- basedir (location of the folder to save the figure to)
- map_img (path to the map figure used as the background)
- meta_df (dataframe containing samples as rows and metadata categories as columns)
Returns:
- nothing, but saves figure 'study_map' to the figures folder
'''
dates = open_csv(dates) #open csv file with study dates
locs = open_csv(locs) #open csv file with study locations
plt.figure(figsize=(15,8)) #set up figure
ax1 = plt.subplot2grid((80,4), (0,0), rowspan=36) #add axis for number of publications
ax2 = plt.subplot2grid((2,4), (0,1), colspan=2) #add axis for map
ax1.set_title('A', loc='left', fontsize=fs_main, fontweight='bold') #add axis label
ax2.set_title('B', loc='left', fontsize=fs_main, fontweight='bold') #add axis label
plt.sca(ax2)
img = plt.imread(map_img) #get the background map picture and set the main and inset axis to show only the regions of interest
ax2.imshow(img, extent=[-180, 180, -90, 90], alpha=0.6)
years, marine, freshwater, terrestrial, aquatic, marine_unav, freshwater_unav, terrestrial_unav, aquatic_unav = [], [], [], [], [], [], [], [], [] #set up lists to add data to
mar, fw, terr, aqu, mar_unav, fw_unav, terr_unav, aqu_unav = 0, 0, 0, 0, 0, 0, 0, 0 #set all numbers to zero to start
for a in range(1, len(dates)): #go through and see if there is data for each cell
for b in range(len(dates[a])):
if dates[a][b] == '': dates[a][b] = 0 #if there's nothing in the cell, make this equal to 0
else: dates[a][b] = int(dates[a][b]) #if there is a number, make this into an integer
years.append(dates[a][0]) #add year to list
#add all cumulative numbers on
mar += dates[a][1]
mar_unav += dates[a][2]
fw += dates[a][3]
fw_unav += dates[a][4]
aqu += dates[a][5]
aqu_unav += dates[a][6]
terr += dates[a][7]
terr_unav += dates[a][8]
#append the new cumulative numbers to the respective lists for each environment
marine.append(mar), marine_unav.append(mar_unav), freshwater.append(fw), freshwater_unav.append(fw_unav), aquatic.append(aqu), aquatic_unav.append(aqu_unav), terrestrial.append(terr), terrestrial_unav.append(terr_unav)
adding = [terrestrial, terrestrial_unav, aquatic, aquatic_unav, freshwater, freshwater_unav, marine, marine_unav]
#now add all of the environments together
for a in range(1, len(adding)):
for b in range(len(adding[a])):
adding[a][b] += adding[a-1][b]
#set up the colors that we will use and the transparency for the studies with no data
colors_2 = [color_env['terrestrial'], color_env['terrestrial'], color_env['aquatic'], color_env['aquatic'], color_env['freshwater'], color_env['freshwater'], color_env['marine'], color_env['marine']]
a1, a2 = 1, 0.3
alphas = [a1, a2, a1, a2, a1, a2, a1, a2]
for a in range(len(adding)): #for each line that we have set up
ax1.plot(years, adding[a], 'grey', lw=1) #plot the line in grey
if a > 0: #and fill between that line and either the previous line or zero
ax1.fill_between(years, adding[a-1], adding[a], facecolor=colors_2[a], alpha=alphas[a])
else:
ax1.fill_between(years, 0, adding[a], facecolor=colors_2[a], alpha=alphas[a])
#show only the parts of the axis that we are interested in
ax1.set_xlim([2012, 2020])
ax1.set_ylim([0, 55])
#add axis labels and change the font sizes to be consistent throughout
ax1.set_ylabel('Cumulative number of\nPlastisphere publications', fontsize=fs_small)
ax1.set_xlabel('Year', fontsize=fs_main)
ax1.set_xticks([2012, 2014, 2016, 2018, 2020])
ax1.tick_params(axis='both', which='major', labelsize=fs_small)
ax1.tick_params(axis='both', which='minor', labelsize=fs_small)
#make a custom legend (this is so we have boxes for colors rather than lines)
marine = mlines.Line2D([], [], color=color_env['marine'], marker='s', markersize=8, markeredgecolor='k', label=label_env[0], linestyle=' ')
freshwater = mlines.Line2D([], [], color=color_env['freshwater'], marker='s', markersize=8, markeredgecolor='k', label=label_env[1], linestyle=' ')
aquatic = mlines.Line2D([], [], color=color_env['aquatic'], marker='s', markersize=8, markeredgecolor='k', label=label_env[2], linestyle=' ')
terrestrial = mlines.Line2D([], [], color=color_env['terrestrial'], marker='s', markersize=8, markeredgecolor='k', label=label_env[3], linestyle=' ')
plt.sca(ax1)
plt.legend(handles=[marine,freshwater,aquatic, terrestrial], loc='upper left', fontsize=fs_small)
#make all locations into floats (rather than the default strings)
for a in range(1, len(locs)):
for b in range(len(locs[a])):
if b == 1 or b == 2 or b == 3:
locs[a][b] = float(locs[a][b])
axins1 = zoomed_inset_axes(ax2, 2.5, loc='upper left', bbox_to_anchor=(0.25, 0.5), bbox_transform=ax2.transAxes)
axins1.imshow(img, extent=[-180, 180, -90, 90], alpha=0.6)
axins1.set_xlim([-22, 30]), axins1.set_ylim([25, 65])
mark_inset(ax2, axins1, loc1a=1, loc1b=4, loc2a=2, loc2b=3, fc="none", ec="k")
axins2 = zoomed_inset_axes(ax2, 2.1, loc='upper right', bbox_to_anchor=(0.99, 0.45), bbox_transform=ax2.transAxes)
axins2.imshow(img, extent=[-180, 180, -90, 90], alpha=0.6)
axins2.set_xlim([105, 130]), axins2.set_ylim([15, 45])
mark_inset(ax2, axins2, loc1a=1, loc1b=4, loc2a=2, loc2b=3, fc="none", ec="k")
for ax in [ax2, axins1, axins2]:
plt.sca(ax)
plt.yticks([]), plt.xticks([])
#plot each location, checking which color and marker these need based on the values in the other columns of the file
for i in range(1, len(locs)):
if locs[i][5] == 'Yes': alpha, edge = 1, 'k'
else: alpha, edge = 1, 'w'
if locs[i][6] == 'Lab': marker='^'
elif locs[i][6] == 'Field': marker = 'o'
elif locs[i][6] == 'Both': marker = '*'
ax2.scatter(locs[i][2], locs[i][1], color=color_env[locs[i][4]], edgecolor=edge, marker=marker, s=15, linewidth=0.4, alpha=alpha)
axins1.scatter(locs[i][2], locs[i][1], color=color_env[locs[i][4]], edgecolor=edge, marker=marker, s=25, linewidth=0.4, alpha=alpha)
axins2.scatter(locs[i][2], locs[i][1], color=color_env[locs[i][4]], edgecolor=edge, marker=marker, s=20, linewidth=0.4, alpha=alpha)
#again add a custom legend
plt.sca(ax2)
lab = mlines.Line2D([], [], color='w', marker='^', markersize=8, markeredgecolor='k', label='Laboratory', linestyle=' ')
field = mlines.Line2D([], [], color='w', marker='o', markersize=8, markeredgecolor='k', label='Field', linestyle=' ')
both = mlines.Line2D([], [], color='w', marker='*', markersize=8, markeredgecolor='k', label='Both', linestyle=' ')
gap = mlines.Line2D([], [], color='w', marker='o', markersize=2, markeredgecolor='w', alpha=0, label='', linestyle=' ')
plt.legend(handles=[marine,freshwater,aquatic, terrestrial, gap, lab, field, both], loc='lower left', fontsize=fs_small)
#now save the figure (using the file extension and dpi specified at the top of the file) and close it
#set up the figure and plots
#plt.figure(figsize=(10,3))
ax1 = plt.subplot2grid((2,4), (1,0))
ax2 = plt.subplot2grid((2,4), (1,1))
ax3 = plt.subplot2grid((2,4), (1,2))
samples = list(meta_df.index.values) #get all sample names
envs, env_caps = ['terrestrial', 'aquatic', 'freshwater', 'marine'], ['Terrestrial', 'Aquatic', 'Freshwater', 'Marine'] #set up lists of the environment names
for a in range(len(envs)): #for each of the environments
bottom = 0 #reset the bottom
for b in ['lab', 'field']: #for each of lab and field
count = 0 #reset the count
for c in range(len(samples)): #for each of the sample names
if meta_df.loc[samples[c], 'Environment'] == envs[a] and meta_df.loc[samples[c], 'LabOrField'] == b: #if the metadata shows the sample is in the environment and in the lab/field category we are looking at
count += 1 #add it to the count
if b == 'lab': #if we are looking at lab samples
ax1.barh(a, count, left=bottom, color=color_env[envs[a]], hatch='//', edgecolor='k') #plot a bar with the correct colors and hatching
else: #otherwise
ax1.barh(a, count, left=bottom, color=color_env[envs[a]], edgecolor='k') #use the right color, but not hatched
if count > 100: #if the count is above 100 (i.e. the bar is large enough to show text within)
ax1.text((count/2)+bottom, a, str(count), color='w', ha='center', va='center', fontsize=fs_small) #add text saying the number of samples in this category
bottom = count+bottom #now add this count to the bottom
if b == 'field': #if we are looking at field samples (i.e. we have now plotted both bars for this environmet)
ax1.text(bottom+20, a, str(bottom), color='k', ha='left', va='center', fontsize=fs_small) #print the total number of samples in this environment next to the bar
pie, inc = [], [] #set up new lists
for c in range(len(samples)): #for each of the samples
if meta_df.loc[samples[c], 'Environment'] == envs[a]: #if this is the environment we are looking at
pie.append(meta_df.loc[samples[c], 'PlasticTypeGeneral']) #add the plastic type to the list
inc.append(meta_df.loc[samples[c], 'IncubationGeneral']) #add the incubation time to the other list
unique = list(set(pie)) #get all of the unique values for plastic type in this environment
pie_color = [color_source[x] for x in unique] #and get the colors that match each plastic type from the dictionary at the top
pie_count = [pie.count(x) for x in unique] #and count how many samples of each plastic type
pie_count = [(x/sum(pie_count))*100 for x in pie_count] #and get these counts as relative abundance
left=0 #reset the left count
for d in range(len(pie_count)): #for each sample type
ax2.barh(a, pie_count[d], left=left, color=pie_color[d], label=unique[d], edgecolor='k') #plot the stacked bar corresponding to sample type
fc = 'w' #set the fontcolor as white
if unique[d] == 'planktonic' or unique[d] == 'biofilm': #unless it's planktonic or biofilm
fc = 'k' #in which case, the fontcolor should be black
if pie_count[d] > 5: #if this category has more than 5% of samples (i.e. is big enought to add text)
ax2.text((pie_count[d]/2)+left, a, str(int(pie_count[d])), color=fc, ha='center', va='center', fontsize=fs_small) #add text showing the percentage of samples with this plastic type
left += pie_count[d] #now add this number to the left count (to make the bar stacked)
unique = list(set(inc)) #now we are doing the same for incubation time, so get the unique values for incubation time in this environment
if len(unique) == 3: #if this list is equal to 3
unique = ['early', 'late', 'collection'] #change the order to be this
pie_color = [color_source[x] for x in unique] #and get the colors that match these incubation times
pie_count = [inc.count(x) for x in unique] #count how many samples for each incubation time
pie_count = [(x/sum(pie_count))*100 for x in pie_count] #and convert these counts to relative abundance
left = 0 #reset the left count
for d in range(len(pie_count)): #for each sample type
ax3.barh(a, pie_count[d], left=left, color=pie_color[d], label=unique[d], edgecolor='k') #plot the stacked bar corresponding to sample type
fc = 'w' #set the fontcolor as white
if unique[d] == 'early': #unless it's an early incubation time
fc = 'k' #in which case the fontcolor should be black
if pie_count[d] > 5: #if this accounts for more than 5% of relative abundance (i.e. is big enough to add text)
ax3.text((pie_count[d]/2)+left, a, str(int(pie_count[d])), color=fc, ha='center', va='center', fontsize=fs_small) #add text showing the percentage of samples with this incubation time
left += pie_count[d] #now add the number to the left count
#set x and y labels, ticks, limits, etc. and make patches and legends for each plot
ax1.set_xlabel('Number of samples', fontsize=fs_small)
plt.sca(ax1)
plt.yticks([0, 1, 2, 3], env_caps, fontsize=fs_small)
plt.xticks(fontsize=fs_small)
plt.xlim([0, 1400])
lab = patches.Patch(facecolor='w', edgecolor='k', hatch='//', label='Laboratory')
field = patches.Patch(facecolor='w', edgecolor='k', label='Field')
ax1.legend(handles=[lab, field], fontsize=fs_main, bbox_to_anchor=(0., 0, 1., -.25), loc='upper left', borderaxespad=0., mode='expand', ncol=2)
sources = ['aliphatic', 'other plastic', 'unknown plastic', 'biofilm', 'planktonic', 'blank']
handles = [patches.Patch(facecolor=color_source[x], edgecolor='k', label=x.capitalize()) for x in sources]
ax2.legend(handles=handles, fontsize=fs_small, bbox_to_anchor=(0., 0, 1., -.25), loc='upper left', borderaxespad=0., mode='expand', ncol=2)
plt.sca(ax2)
plt.yticks([])
plt.xticks(fontsize=fs_small)
plt.xlabel('Relative abundance (%)', fontsize=fs_small)
plt.xlim([0, 100])
times = ['early', 'late', 'collection']
handles = [patches.Patch(facecolor=color_source[x], edgecolor='k', label=x.capitalize()) for x in times]
ax3.legend(handles=handles, fontsize=fs_small, bbox_to_anchor=(0., 0, 1., -.25), loc='upper left', borderaxespad=0., mode='expand', ncol=3)
plt.sca(ax3)
plt.yticks([])
plt.xticks(fontsize=fs_small)
plt.xlabel('Relative abundance (%)', fontsize=fs_small)
plt.xlim([0, 100])
plt.subplots_adjust(wspace=0.1)
ax1.set_title('C', loc='left', fontsize=fs_main, fontweight='bold')
ax2.set_title('D', loc='left', fontsize=fs_main, fontweight='bold')
ax3.set_title('E', loc='left', fontsize=fs_main, fontweight='bold')
plt.savefig(basedir+'/figures/'+'Fig1_map_metrics'+ext, dpi=dpi, bbox_inches='tight')
plt.close()
return
def get_single_nmds(rows, filter_on, filt_ind, color_on, color_ind, ax, leg, colors, names, meta_dict, second_filter='', second_filter_ind='', npos='', n_jobs=1, get_stats=False):
'''
Function to get a single nmds plot on an axis
Takes as input:
- rows (a distance matrix with no row or column names, i.e. containing only numbers)
- filter_on (if we want to filter out samples from the matrix, then this contains what we are filtering on)
- filter_ind (and the index in the metadata that we are filtering on)
- color_on (the category that we are coloring samples on)
- color_ind (the index for this category in the metadata)
- ax (the axis we are plotting on)
- leg (the position of the index we are plotting)
- colors (the colors to use for plotting categories)
- names (the sample names, in the same order as the samples occur in rows, the distance matrix)
- meta_dict (a dictionary containing all metadata for each sample, with sample names as keys)
- second_filter (if we want to filter samples based on a second category, then this contains what we are filtering on. Default is '')
- second_filter_ind (and the index in the metadata that we are filtering on. Default is '')
- npos (the nmds positions - this saves calculating the nmds points for a second time if we are plotting the same data again, e.g. just with different colored points)
- n_jobs (the number of processors to use for calculating the nmds plot - this doesn't seem to work if we put more than 1, so isn't actually used)
- get_stats (this tells us whether to get stats based on our categories - e.g. permanova and anosim - default is False)
'''
plt.sca(ax)
color = []
#If we are filtering, go through and save only the samples that are included into new lists
if filter_on != 'none':
new_names, new_ind = [], []
for a in range(len(names)):
if meta_dict[names[a]][filt_ind] == filter_on:
if second_filter != '':
if meta_dict[names[a]][second_filter_ind] != second_filter:
continue
new_names.append(names[a])
new_ind.append(a)
new_rows = []
if new_names == []:
return
for b in range(len(rows)):
if b in new_ind:
this_row = []
for c in range(len(rows[b])):
if c in new_ind:
this_row.append(rows[b][c])
new_rows.append(this_row)
names = new_names
rows = new_rows
#get the appropriate colors for all included samples
groups = []
for a in range(len(names)):
added = False
groups.append(meta_dict[names[a]][color_ind])
for b in range(len(color_on)):
if meta_dict[names[a]][color_ind] == color_on[b]:
color.append(colors[b])
added = True
if not added:
print('Didnt find ', color_on[b], 'in the metadata file so no color for this or in the list of colors')
#if we don't already have the values for npos, then transform the similarity matrix for this
if npos == '':
s = pd.DataFrame(rows)
npos, stress = transform_for_NMDS(s, n_jobs=n_jobs)
else:
s = pd.DataFrame(rows)
if get_stats:
dm = DistanceMatrix(s)
ans = anosim(dm, np.array(groups))
perm = permanova(dm, np.array(groups))
string = ans[0]+': '+ans[1]+'='+str(round(ans[4], 3))+r', $p$='+str(round(ans[5], 3))
string += '\n'+perm[0]+': '+perm[1]+'='+str(round(perm[4], 3))+r', $p$='+str(round(perm[5], 3))
t = plt.text(0.02, 0.98, string, ha='left', va='top', transform = ax.transAxes, fontsize=fs_main, fontweight='bold')
t.set_bbox(dict(facecolor='white', alpha=0.8))
size = 20
#plot the values for nmds 1 (npos[a,0]) and nmds 2 (npos[a,1]) for all samples, giving the color as determined by sample type
for a in range(len(rows[0])):
plt.scatter(npos[a,0], npos[a,1], color=color[a], marker='o', s=size, edgecolor='k')
#get the legend handles incase these are being plotted
handles = []
for a in range(len(color_on)):
if color_on[a] in rename_plots:
this_color = rename_plots[color_on[a]]
else:
this_color = color_on[a]
handles.append(mlines.Line2D([], [], color=colors[a], marker='o', markersize=fs_main, markeredgecolor='k', label=this_color, linestyle=' '))
if filter_on != 'none':
return handles
else:
return npos, handles
def nmds_plot_study_env(dist_matr_fn_w, dist_matr_fn_uw, meta_dict, basedir, n_jobs=1, save_name='/figures/Fig2_nmds_overall'):
'''
Function to make nmds plots for weighted and unweighted unifrac distances for all samples, either colored by environment or by study
Takes as input:
- dist_matr_fn_w (file name of the weighted unifrac - can be .csv or .txt - assumes this is a file with sample names as rows and columns with values showing their similarity in the matrix)
- dist_matr_fn_uw (file name of the unweighted unifrac - as above)
- meta_dict (dictionary containing metadata for all samples, with sample names as keys)
- basedir (name of the directory to save the figures to)
- n_jobs (number of processors to use for calculating the best ordination - doesn't seem to work so we leave with the default of 1)
Returns:
- nothing, but saves figure 'nmds_overall' to the figures folder
'''
plt.figure(figsize=(15,15))
ax1 = plt.subplot2grid((2,2), (0,0))
ax2 = plt.subplot2grid((2,2), (0,1))
ax3 = plt.subplot2grid((2,2), (1,0))
ax4 = plt.subplot2grid((2,2), (1,1))
dist = [dist_matr_fn_w, dist_matr_fn_uw]
ax = [[ax1, ax2], [ax3, ax4]]
envs, env_index = ['marine', 'freshwater', 'aquatic', 'terrestrial'], 3
color_env = ['#03A498', '#03B1FC', '#9A75FC', '#FD9A64']
study, study_index = ['AmaralZettler','AriasAndres','Canada','Curren','Delacuvellerie','DeTender','DeTenderB','DussudHudec','DussudMeistertzheim','ErniCassola','Esan','Frere','Hoellein','HoelleinB','Jiang','Kesy','Kirstein','KirsteinB','McCormick','McCormickB','Oberbeckmann','OberbeckmannB','Ogonowski','Parrish', 'Pinto','Pollet','Rosato','Syranidou','SyranidouPE','SyranidouPS','Tagg','Woodall','Wu','Xu','Zhang'], 0
colormap_40b, colormap_40c, colormap_40a = mpl.cm.get_cmap('tab20b', 256), mpl.cm.get_cmap('tab20c', 256), mpl.cm.get_cmap('tab20', 256)
norm, norm2, norm3 = mpl.colors.Normalize(vmin=0, vmax=19), mpl.colors.Normalize(vmin=20, vmax=39), mpl.colors.Normalize(vmin=40, vmax=59)
m, m2, m3 = mpl.cm.ScalarMappable(norm=norm, cmap=colormap_40b), mpl.cm.ScalarMappable(norm=norm2, cmap=colormap_40c), mpl.cm.ScalarMappable(norm=norm3, cmap=colormap_40a)
for z in range(len(dist)):
dist_matr = pd.read_csv(basedir+dist[z], header=0, index_col=0) #read in the distance matrix
dist_matr = dist_matr.astype('float').fillna(value=0) #turn all values into floats
names = list(dist_matr.columns) #get a list of column names
dist_matr = dist_matr.rename_axis('ID').values #and turn it into a matrix with no row or column names
color_study = []
for a in range(len(study)):
if a < 20:
color_study.append(m.to_rgba(a))
elif a < 40:
color_study.append(m2.to_rgba(a))
else:
color_study.append(m3.to_rgba(a))
filter_on, filter_index = 'none', 'none'
second_filter, second_filter_ind = '', ''
npos, handles = get_single_nmds(dist_matr, filter_on, filter_index, envs, env_index, ax[z][0], 'upper left', color_env, names, meta_dict, second_filter, second_filter_ind, '', n_jobs=n_jobs, get_stats=True)
if z == 0:
plt.sca(ax[z][0])
plt.legend(handles=handles, loc='upper right', fontsize=fs_main, edgecolor='k')
else:
plt.sca(ax[z][0])
plt.legend(handles=handles, loc='lower left', fontsize=fs_main, edgecolor='k')
npos, handles = get_single_nmds(dist_matr, filter_on, filter_index, study, study_index, ax[z][1], 'upper right', color_study, names, meta_dict, second_filter, second_filter_ind, npos, n_jobs=n_jobs, get_stats=True)
if z == 0:
plt.sca(ax[z][1])
plt.legend(handles=handles, bbox_to_anchor=(1.05,1.025), fontsize=fs_main, edgecolor='k')
titles = [r'$\bf{Environment}$', r'$\bf{Study}$', '', '']
title_letter = ['A', 'B', 'C', 'D']
axes = [ax1, ax2, ax3, ax4]
ylabs, xlabs = [r'$\bf{Weighted}$ $\bf{unifrac}$'+'\nnMDS2', '', r'$\bf{Unweighted}$ $\bf{unifrac}$'+'\nnMDS2', ''], ['', '', 'nMDS1', 'nMDS1']
for a in range(len(titles)):
axes[a].set_title(titles[a], fontsize=fs_title)
axes[a].set_title(title_letter[a], fontsize=fs_title, loc='left')
axes[a].set_xlabel(xlabs[a], fontsize=fs_title)
axes[a].set_ylabel(ylabs[a], fontsize=fs_title)
plt.savefig(basedir+save_name+ext, dpi=dpi, bbox_inches='tight')
plt.close()
return
def plot_box(ax, l, r, b, t, line_col):
'''
Function to plot a box on a given axes
Takes as input:
- ax (the axis to plot on)
- l, r, b, t (left, right, bottom and top coordinates)
- line_col (the color to use to plot the line)
'''
plt.sca(ax)
plt.plot([l, r], [b, b], color=line_col, lw=2)
plt.plot([l, r], [t, t], color=line_col, lw=2)
plt.plot([l, l], [t, b], color=line_col, lw=2)
plt.plot([r, r], [t, b], color=line_col, lw=2)
return
def similarity_heatmap_combined(dist_matr_fn_w, dist_matr_fn_uw, basedir, save_name='/figures/Fig3_unifrac_heatmap_combined'):
'''
Function to plot the heatmap showing both weighted and unweighted unifrac distances for all studies
Takes as input:
- dist_matr_fn_w (file name of the weighted unifrac - can be .csv or .txt - assumes this is a file with sample names as rows and columns with values showing their similarity in the matrix)
- dist_matr_fn_uw (file name of the unweighted unifrac - as above)
- basedir (name of the directory to save the figures to)
Returns:
- nothing, but saves figure 'unifrac_heatmap_combined' to the figures folder
'''
dist_matr_files = [dist_matr_fn_w, dist_matr_fn_uw]
fig = plt.figure(figsize=(12,12))
ax1 = plt.subplot(211)
ax2 = plt.subplot(212)
ax = [ax1, ax2]
for z in range(len(dist_matr_files)):
plt.sca(ax[z])
dist_matr = dist_matr_files[z]
dist_matr = dist_matr.reset_index().drop('index', axis=1)
dist_matr = [list(dist_matr.columns)]+dist_matr.values.tolist()
studies, first = [], [] #set up lists to store the names of all studies as well as the index for when these start within the distance matrices
for a in range(len(dist_matr[0])): #loop through the study names
if dist_matr[0][a] == '': #if the cell is empty, continue
continue
if dist_matr[0][a][:6] not in studies: #if the study part of the name (excluding sample number) isn't in the list, add it
studies.append(dist_matr[0][a][:6])
first.append(a) #and also add the index for where this study starts
marine, freshwater, aquatic, terrestrial = [], [], [], [] #set up lists to store which samples are from which environment
for a in range(len(studies)): #loop through all study names
#for each, get the environment from the dictionary at the top
#append the study name to the correct environment if it matches one of the environment types
if name_env[studies[a]] == 'marine':
marine.append(a)
elif name_env[studies[a]] == 'freshwater':
freshwater.append(a)
elif name_env[studies[a]] == 'aquatic':
aquatic.append(a)
elif name_env[studies[a]] == 'terrestrial':
terrestrial.append(a)
order = marine+aquatic+freshwater+terrestrial #the order for the studies to be plotted (so that environments are grouped together)
lens = [len(marine), len(aquatic), len(freshwater), len(terrestrial)] #get the length (/number of studies) of each of the environments
text_cols = [] #set up a list for the text colors to be added to so these match the environment type
for a in range(len(marine)):
text_cols.append(color_env['marine'])
for a in range(len(aquatic)):
text_cols.append(color_env['aquatic'])
for a in range(len(freshwater)):
text_cols.append(color_env['freshwater'])
for a in range(len(terrestrial)):
text_cols.append(color_env['terrestrial'])
new_text_cols = [] #now rearrange them so they are in the order they will be plotted
for a in range(len(order)):
for b in range(len(order)):
if a == order[b]:
new_text_cols.append(text_cols[b])
each_study, each_row = [], []
#go through the distance matrix and calculate means for the distance between samples within and between different studies
for a in range(1, len(dist_matr)):
this_num, this_row = [], []
for b in range(1, len(dist_matr[a])):
if b in first or b == len(dist_matr[a])-1:
if b == len(dist_matr[a])-1:
this_num.append(float(dist_matr[a][b]))
if this_num == []:
continue
this_row.append(np.mean(this_num))
this_num = []
else:
if float(dist_matr[a][b]) != 0:
this_num.append(float(dist_matr[a][b]))
if a in first or a == len(dist_matr)-1:
if a == len(dist_matr)-1:
each_row.append(this_row)
if each_row == []:
continue
this_study = []
for c in range(len(each_row[0])):
this_num = []
for d in range(len(each_row)):
this_num.append(each_row[d][c])
this_study.append(np.mean(this_num))
each_study.append(this_study)
each_row = []
actual_row = []
for d in range(len(order)):
actual_row.append(this_row[order[d]])
each_row.append(actual_row)
#now change the order of the studies so that they are grouped by environment
each_study_new, studies_new = [], []
for a in range(len(order)):
each_study_new.append(each_study[order[a]])
studies_new.append(studies[order[a]])
#reverse these as they will be plotted from the bottom not the top
each_study_new.reverse()
studies_new.reverse()
each_study, studies = each_study_new, studies_new
#get the minimum and maximum distances between studies so that colors can be normalised to this range
colormap = mpl.cm.get_cmap('inferno_r', 256)
ma, mi = [], []
for a in range(len(each_study)):
ma.append(max(each_study[a]))
mi.append(min(each_study[a]))
norm = mpl.colors.Normalize(vmin=min(mi), vmax=max(ma))
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
#set up figure and axes
ally = []
#now go through and get the equivalent color for each mean distance and plot these as a bar plot
means = []
for a in range(len(each_study)):
colors = []
x, y, bottom = [], [], []
mean = []
for b in range(len(each_study[a])):
if b != a:
mean.append(each_study[a][b])
color = m.to_rgba(each_study[a][b])
colors.append(color)
x.append(b+1)
y.append(1)
bottom.append(a)
plt.bar(b+1, 1, bottom=a, color=color, edgecolor='k', width=1)
means.append(np.mean(mean))
ally.append(a+0.5)
cb = plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=colormap),ax=ax[z], shrink=0.4, pad=0.02, orientation='vertical')
cb.ax.tick_params(labelsize=fs_small)
if z == 0:
cb.set_label('Weighted\nunifrac distance', fontsize=fs_main)
il = 'Weighted unifrac distance'
else:
cb.set_label('Unweighted\nunifrac distance', fontsize=fs_main)
il = 'Unweighted unifrac distance'
#change x and y limits
plt.xlim([0.5,ally[-1]+1])
plt.ylim([0,ally[-1]+0.5])
colors = ['#03A498', '#03B1FC', '#9A75FC', '#FD9A64']
line_col = 'w'
#plot white boxes around samples of the same environments
l, r, b, t = 0.54, lens[0]+0.43, ally[-1]-lens[0]+0.57, ally[-1]+0.48#left, right, bottom, top
plot_box(ax[z], l, r, b, t, line_col)
l, r, b, t = l+lens[0], r+lens[1], b-lens[1], t-lens[0]
plot_box(ax[z], l, r, b, t, line_col)
l, r, b, t = l+lens[1], r+lens[2], b-lens[2], t-lens[1]
plot_box(ax[z], l, r, b, t, line_col)
l, r, b, t = l+lens[2], r+lens[3], b-lens[3], t-lens[2]
plot_box(ax[z], l, r, b, t, line_col)
for a in range(len(studies)): #rename the studies to a full study name rather than the short code they were given
studies[a] = name_dict_2[studies[a]]
new_means = pd.DataFrame([means], columns=studies, index=[il])
if z == 0:
means_df = new_means
else:
means_df = pd.concat([means_df, new_means])
write_pickle(basedir+'/mean_unifrac.dataframe', means_df)
plt.sca(ax[z])
empty = []
text_cols.reverse()
for a in range(len(ally)): #plot the study names with the correct colors on the y axis
plt.text(0, ally[a], studies[a], color=text_cols[a], va='center', ha='right', fontsize=fs_small)
empty.append('')
plt.yticks(ally, empty, fontsize=fs_small) #remove the y ticks
if z == 0:
studies.reverse()
for a in range(len(x)): #not plot the study names on the x axis (the list is reversed as we plotted from the bottom up)
plt.text(x[a], ally[-1]+1.5, studies[a], rotation=90, color=text_cols[-(a+1)], va='bottom', ha='center', fontsize=fs_small)
plt.xticks(x, empty, fontsize=fs_small) #remove the x ticks
ax[z].xaxis.tick_top()
else:
plt.xticks([])
plt.subplots_adjust(hspace=0.05)
plt.savefig(basedir+save_name+ext, dpi=dpi, bbox_inches='tight')
plt.close()
return
def get_venn_labels(data):
'''
Calculate the number of overlapping ASVs between different samples
It takes as input:
- data (which should be a list of lists, containing the taxa present in different samples)
'''
N = len(data)
sets_data = [set(data[i]) for i in range(N)]
s_all = set(chain(*data))
set_collections = {}
for n in range(1, 2**N):
key = bin(n).split('0b')[-1].zfill(N)
value = s_all
sets_for_intersection = [sets_data[i] for i in range(N) if key[i] == '1']
sets_for_difference = [sets_data[i] for i in range(N) if key[i] == '0']
for s in sets_for_intersection:
value = value & s
for s in sets_for_difference:
value = value - s
set_collections[key] = value
labels = []
for k in set_collections:
labels.append(str(len(set_collections[k])))
return labels
def get_venn(labels, ax, colors, names=['A', 'B', 'C', 'D', 'E'], plotting=[True, True, True, True, True]):
'''
This is a function to get a venn diagram with up to 5 ellipses (only plotted if that sample type is present, but the layout remains the same)
It takes as input:
- labels (a list of labels as output by the get_venn_labels function)
- ax (axis to plot the venn diagram on)
- colors (list of colors to use for each ellipse, expected that this will have length 5)
- names (list of names to use for each ellipse, expected that this will have length 5. Default=['A', 'B', 'C', 'D', 'E'])
- plotting (list of True/False for whether to plot each ellipse, expected that this will have length 5. Default is plotting=[True, True, True, True, True])
'''
#list of locations for each point of the ellipses
el1, el2, el3, el4, el5 = [0.428, 0.469, 0.558, 0.578, 0.489], [0.449, 0.543, 0.523, 0.432, 0.383], [0.87, 0.87, 0.87, 0.87, 0.87], [0.50, 0.50, 0.50, 0.50, 0.50], [155.0, 82.0, 10.0, 118.0, 46.0]
for a in range(len(el1)): #plot all ellipses
if plotting[a]: #only if this value is true for each
e = patches.Ellipse(xy=(el1[a], el2[a]), width=el3[a], height=el4[a], angle=el5[a], color=colors[a], alpha=0.4)
ax.add_patch(e)
#list of text locations
x = [0.27, 0.72, 0.55, 0.91, 0.78, 0.84, 0.76, 0.51, 0.39, 0.42, 0.50, 0.67, 0.70, 0.51, 0.64, 0.10, 0.20, 0.76, 0.65, 0.18, 0.21, 0.81, 0.74, 0.27, 0.34, 0.33, 0.51, 0.25, 0.28, 0.36, 0.51]
y = [0.11, 0.11, 0.13, 0.58, 0.64, 0.41, 0.55, 0.90, 0.15, 0.78, 0.15, 0.76, 0.71, 0.74, 0.67, 0.61, 0.31, 0.25, 0.23, 0.50, 0.37, 0.37, 0.40, 0.70, 0.25, 0.72, 0.22, 0.58, 0.39, 0.66, 0.47]
for a in range(len(x)):
if labels[a] != '0': #only plot the overlap number if it is bigger than 0 (it will probably be 0 only if that sample type wasn't present and we aren't plotting the ellipse anyway)
ax.text(x[a], y[a], labels[a], horizontalalignment='center',verticalalignment='center',fontsize=fs_small,color="black")
#list of text locations
x, y = [0.02, 0.72, 0.97, 0.88, 0.12], [0.72, 0.94, 0.74, 0.05, 0.05]
for a in range(len(x)):
if plotting[a]: #add the sample type name if the value for plotting is true
ax.text(x[a], y[a], names[a], horizontalalignment='center',verticalalignment='center',fontsize=fs_main,color="black")
return
def get_diversity(diversity, sample):
'''
function to calculate a range of different diversity metrics
It takes as input:
- diversity (the name of the diversity metric we want, can be 'Simpsons', 'Shannon', 'Richness', 'Evenness', 'Maximum' (Maximum is not a diversity metric, the function will just return the maximum abundance value given in sample)
- sample (a list of abundance values that should correspond to one sample)
Returns:
- The diversity index for the individual sample
'''
for a in range(len(sample)):
sample[a] = float(sample[a])
total = sum(sample)
if diversity == 'Simpsons':
for b in range(len(sample)):
sample[b] = (sample[b]/total)**2
simpsons = 1-(sum(sample))
return simpsons
elif diversity == 'Shannon':
for b in range(len(sample)):
sample[b] = (sample[b]/total)
if sample[b] != 0:
sample[b] = -(sample[b] * (np.log(sample[b])))
shannon = sum(sample)
return shannon
elif diversity == 'Richness':
rich = 0
for b in range(len(sample)):
if sample[b] != 0:
rich += 1
return rich
elif diversity == 'Evenness':
for b in range(len(sample)):
sample[b] = (sample[b]/total)
if sample[b] != 0:
sample[b] = -(sample[b] * (np.log(sample[b])))
shannon = sum(sample)
rich = 0
for b in range(len(sample)):
if sample[b] != 0:
rich += 1
even = shannon/(np.log(rich))
return even
elif diversity == 'Maximum':
ma = (max(sample)/total)*100
return ma
return
def get_cols(num):
'''
This is a function to get a list of either 20 or 40 colors depending on the input 'num' (assumed that this is a number between 0 and 40)
'''
colormap_20 = mpl.cm.get_cmap('tab20', 256)
norm = mpl.colors.Normalize(vmin=0, vmax=19)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap_20)
colors_20 = []
for a in range(20):
colors_20.append(m.to_rgba(a))
colormap_40b = mpl.cm.get_cmap('tab20b', 256)
norm = mpl.colors.Normalize(vmin=0, vmax=19)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap_40b)
colors_40 = []
for a in range(20):
colors_40.append(m.to_rgba(a))
colormap_40c = mpl.cm.get_cmap('tab20c', 256)
norm = mpl.colors.Normalize(vmin=20, vmax=39)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap_40c)
for a in range(20):
a += 20
colors_40.append(m.to_rgba(a))
if num < 21:
return colors_20
else:
return colors_40
def bar_dendro_venn(ft, ft_full, meta_dict, basedir, tax_dict):
'''
Function to make the dendrogram, stacked bar, heatmap, simpsons diversity and venn diagrams of shared ASVs
Takes as input:
- ft (dataframe with samples as columns and ASVs as rows)
- meta_dict (dictionary containing metadata for all samples, with sample names as keys)
- basedir (name of the directory to save the figures to)
- tax_dict (dictionary containing taxonomy information with ASV names as keys)
Returns:
- nothing, but saves figure 'dendro_venn' to the figures folder
'''
#set up the figure and all axes
fig = plt.figure(figsize=(12,19))
ax1 = plt.subplot2grid((2,7), (0,0), frameon=False)
ax2 = plt.subplot2grid((2,7), (0,2), colspan=3)
ax3 = plt.subplot2grid((2,7), (0,5))
ax4 = plt.subplot2grid((2,7), (0,6))
ax3_colbar = plt.subplot2grid((50, 7), (24,6))
ax5 = plt.subplot2grid((27,2), (16,0), rowspan=4, frameon=False)
ax6 = plt.subplot2grid((27,2), (16,1), rowspan=4, frameon=False)
ax7 = plt.subplot2grid((27,2), (21,0), rowspan=4, frameon=False)
ax8 = plt.subplot2grid((27,2), (21,1), rowspan=4, frameon=False)
ft_full = pd.read_csv(ft_full, header=0, index_col=0)
ft_full = ft_full/20
ft = ft*100 #convert the feature table to % relative abundance
colnames = list(ft.columns) #get the column names (sample names)
env, source, env_source_dict, env_source = [], [], {}, []
for a in range(len(colnames)): #use the sample names to get the metadata categories that these samples belong to (environment and source)
env.append(meta_dict[colnames[a]][3])
source.append(meta_dict[colnames[a]][10])
if meta_dict[colnames[a]][5] == 'lab':
meta_dict[colnames[a]][5] = 'laboratory'
env_source_dict[colnames[a]] = meta_dict[colnames[a]][3]+' '+'\n'+meta_dict[colnames[a]][10]
env_source.append(meta_dict[colnames[a]][3]+' '+'\n'+meta_dict[colnames[a]][10])
envs, sources = ['marine', 'freshwater', 'aquatic', 'terrestrial'], ['aliphatic', 'other plastic', 'unknown plastic', 'biofilm', 'planktonic']
colors = [color_source[sources[0]], color_source[sources[1]], color_source[sources[2]], color_source[sources[3]], color_source[sources[4]]]
venn_ax = [ax5, ax6, ax7, ax8]
#sort the data and make venn diagrams with ASVs that overlap between treatments
for a in range(len(envs)):
asv_present, all_unique_asv, plotting = [], [], []
names = []
for b in range(len(sources)):
keeping = []
for c in range(len(env)):
if env[c] == envs[a] and source[c] == sources[b]:
keeping.append(True)
else:
keeping.append(False)
source_ft = ft.loc[:, keeping] #this should now only keep the samples that belong to envs[a] and sources[b]
ma = list(source_ft.max(axis=1))
all_asv = list(source_ft.index.values) #get all ASVs
asvs_here = []
#only keep the ASVs if they are present (i.e. have abundance above 0)
for d in range(len(all_asv)):
if ma[d] == 'nan':
continue
if ma[d] > 0:
asvs_here.append(all_asv[d])
asv_present.append(asvs_here)
all_unique_asv += asvs_here
#tell it to only plot if this treatment exists
if len(asvs_here) == 0:
names.append(sources[b].capitalize())
plotting.append(False)
else:
names.append(sources[b].capitalize())
plotting.append(True)
all_unique_asv = set(all_unique_asv) #get all ASVs that are present in this environment
labels = get_venn_labels(asv_present) #get the labels based on which ASVs are present in each treatment (source)
get_venn(labels, venn_ax[a], colors, names=names, plotting=plotting) #get the venn diagram for this environment
venn_ax[a].set_title(envs[a].capitalize()+'\nTotal ASVs = '+str(len(all_unique_asv)), fontweight="bold", fontsize=fs_main) #change the title for this axes
plt.sca(venn_ax[a])
plt.xticks([]), plt.yticks([]) #remove the x and y ticks
#get simpsons diversity for each sample
indiv_div = ['Simpsons']
for cn in colnames:
indiv_div.append(get_diversity('Simpsons', list(ft_full.loc[:, cn])))
#rename all sample names by a combination of the environment and source
ft_full.rename(columns=env_source_dict, inplace=True)
env_source = list(set(env_source)) #now get a list of all unique environments and sources
ft_full = ft_full.reset_index()
ft_full.rename(columns={'OTUID':'index'}, inplace=True)
ft_full = ft_full.append(pd.Series(indiv_div, index=ft_full.columns), ignore_index=True) #add the simpsons diversity to the overall dataframe
ft_full.index = ft_full['index']
ft_full.drop('index', axis=1, inplace=True)
ft_dendro = ft_full.drop('Simpsons', axis=0) #make a new dataframe without the simpsons index of diversity
ft_dendro = ft_dendro.groupby(by=ft_dendro.columns, axis=1).mean() #group this by the environment/source name, taking a mean rather than a sum for each ASV
ft_dendro.to_csv('grouped_samples_for_unifrac.csv')
r.r_unifrac() #run the R function to perform unifrac on these groupings
w_uf = pd.read_csv(basedir+'weighted_unifrac_grouped_samples.csv', header=0, index_col=0)
snames = list(w_uf.columns)
r_name_dict = {}
for s in snames:
r_name_dict[s] = s.replace('..', ' \n').replace('.', ' ')
w_uf.rename(columns=r_name_dict, index=r_name_dict, inplace=True)
plt.sca(ax1)
#ft_dendro_T = ft_dendro.transpose() #transpose the dataframe
Z = ssd.squareform(w_uf)
Z = hierarchy.linkage(Z, "ward")
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='left') #plot this dendrogram of sample groupings
y_labels, locs, xlocs, labels = list(ax1.get_yticklabels()), list(ax1.get_yticks()), list(ax1.get_xticks()), [] #get all labels so that we can work out which order samples are plotted in (and apply this to the other plots)
for y in y_labels:
labels.append(y.get_text())
plot_labels, pl, colors = [], [], []
grouping = list(w_uf.columns)
sources.append('blank')
#get the appropriate color and label for the order of samples
for a in range(len(labels)):
for b in range(len(envs)):
if envs[b] in grouping[int(labels[a])-1]:
colors.append(color_env[envs[b]])
plot_labels.append(grouping[int(labels[a])-1].capitalize())
pl.append(grouping[int(labels[a])-1])
#add the yticks (with color corresponding to the environment that they came from)
plt.yticks([])
for a in range(len(locs)):
ax1.text(xlocs[0]-1.5, locs[a], plot_labels[a], fontsize=fs_main, bbox=dict(facecolor=colors[a], alpha=0.4, pad=0.5, edgecolor='w'), ha='center', va='center')
plt.sca(ax1)
plt.xticks([])
asv = list(ft_dendro.index.values)
phyla, clss, spec = {}, {}, {}
#for all ASVs, get the phylum, class and species that they belong to (although we don't actually use the species information)
for a in range(len(asv)):
tax = tax_dict[asv[a]]
phyla[asv[a]] = tax[1]
clss[asv[a]] = tax[3]
spec[asv[a]] = tax[-1]
#make new feature tables where ASVs are named by phyla/class and then group all of these phylums/classes (summing rather than taking means now)
ft_T = ft_dendro.transpose()
ft_phyla = ft_T.rename(columns=phyla)
ft_phyla = ft_phyla.groupby(by=ft_phyla.columns, axis=1).sum()
ft_class = ft_T.rename(columns=clss)
ft_class = ft_class.groupby(by=ft_class.columns, axis=1).sum()
ft_phyla = ft_phyla[ft_phyla.columns[ft_phyla.max() > 1]] #only keep those phyla that have a maximum abundance above 1%
yt = []
plot_names = ['Alteromonadales', 'Bacillales', 'Flavobacteriales', 'Oceanospirillales', 'Rhodobacterales', 'Rhodospirillales', 'Sphingomonadales', 'Vibrionales']
colormap = mpl.cm.get_cmap('hot', 256)
norm = mpl.colors.Normalize(vmin=0, vmax=15)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
for a in range(len(pl)): #for every environment/source combination
lst = list(ft_phyla.loc[pl[a], :]) #get a list of the relative abundance values for each phylum
left = 0
colors = get_cols(len(lst)) #get the colors for plotting (based on number of phyla above 1%)
if len(lst) > 40: #if we have more than 40 phyla, print a warning
print('Warning: there are not enough distinct colors for the stacked bar of the dendro and venn plot')
for b in range(len(lst)): #plot the stacked bar chart for each phyla
ax2.barh(a, lst[b], left=left, height=0.8, edgecolor='k', color=colors[b])
left += lst[b]
yt.append(a)
ax2.barh(a, 100-left, left=left, height=0.8, edgecolor='k', color='k') #make the sample up to 100% (accounting for those phyla removed due to being <1% either above here or in the R script that werent agglomerated due to low abundance)
left = 0
x3 = []
for c in range(len(plot_names)): #for each of the families that we are specifically interested in (due to them having been reported in a large number of Plastisphere studies)
num = ft_class.loc[pl[a], plot_names[c]]
ax3.barh(a, 1, left=left, height=1, edgecolor='k', color=m.to_rgba(num)) #now make the heatmap square, and scaled to this abundance
x3.append(left+0.5) #add the y value
left += 1
#make the legend for the stacked bar of phyla, with the correct colors for each phylum
handles = []
phy = list(ft_phyla.columns)
for a in range(len(phy)):
handles.append(mlines.Line2D([], [], color=colors[a], marker='o', markersize=8, markeredgecolor='k', label=phy[a], linestyle=' '))
handles.append(mlines.Line2D([], [], color='k', marker='o', markersize=8, markeredgecolor='k', label='Other', linestyle=' '))
ax2.legend(handles=handles, bbox_to_anchor=(1.03, -0.06), ncol=round(len(lst)/4), fontsize=fs_small)
#change the x and y limits and ticks for these plots
ax2.set_ylim([-0.5, len(pl)-0.5])
ax3.set_ylim([-0.5, len(pl)-0.5])
ax2.set_xlim([0, 100])
ax3.set_xlim([0,8])
plt.sca(ax2)
plt.yticks(yt, [])
plt.sca(ax3)
#add the ticks for each of the families of interest
for a in range(len(plot_names)):
plot_names[a] = r'$'+plot_names[a]+'$'
plt.xticks(x3, plot_names, rotation=90, fontsize=fs_main)
plt.yticks(yt, [])
#now get only the simpsons diversity index for each (kept separately because we don't want to group it by sample as we want to be able to plot the outliers etc in the boxplot)
ft_simps = ft_full.loc[['Simpsons'], :]
for a in range(len(pl)): #for each sample type, get all samples and then make a box plot
this_simp = ft_simps.loc[:, [pl[a]]]
ax4.boxplot(this_simp, positions=[a], vert=False, showfliers=False)
#add x ticks etc and some titles to our plots
plt.sca(ax4)
plt.yticks(yt, [])
ax2.set_xlabel('Relative abundance (%)', fontsize=fs_main)
ax2.set_title('Mean relative abundance of phyla', fontweight='bold', fontsize=fs_title-2)
ax3.set_title('Mean relative\nabundance of\nkey orders', fontweight='bold', fontsize=fs_title-2)
ax4.set_title('Simpsons\nindex\nof diversity', fontweight='bold', fontsize=fs_title-2)
cb1 = mpl.colorbar.ColorbarBase(ax3_colbar, cmap=colormap, norm=norm, orientation='horizontal') #get the colorbar for the heatmap
cb1.set_label('Relative abundance (%)', fontsize=fs_main)
cb1.set_ticks([0, 5, 10, 15])
cb1.ax.tick_params(labelsize=fs_small)
plt.savefig(basedir+'/figures/Fig4_dendro_venn'+ext, dpi=dpi, bbox_inches='tight') #save the figure
plt.close()
return
Note: basedir should be the directory containing the files listed above and you should also change n_jobs to be the number of processors that you want to use (this will affect the speed with which many functions run) and est to the number of estimators that you want to use for the random forest sections. 1 will run very quickly but will not be robust - the analyses in the paper used 10,000.
R:
r_unifrac <- function() {
#Import all of the data (these are large files so this may take a while)
asv_table <- read.csv(paste(py$basedir, "grouped_samples_for_unifrac.csv", sep=""), sep=",", row.names=1)
asv_table = as.matrix(asv_table)
phy_tree <- read_tree(paste(py$basedir, "agglom/reduced_tree.tree", sep=""))
#Convert these to phyloseq objects
ASV = otu_table(asv_table, taxa_are_rows = TRUE)
physeq = phyloseq(ASV,phy_tree)
w_uf <- UniFrac(physeq, weighted=TRUE, normalized=FALSE, fast=TRUE)
w_uf_df <- as.data.frame(as.matrix(w_uf))
write.csv(w_uf_df, paste(py$basedir, "/weighted_unifrac_grouped_samples.csv", sep=""))
}
folder_names = ["agglom", "picrust", "figures", "ancom", "figures/ancom", "figures/metacoder", "random_forest", "random_forest/single_environment", "random_forest/log", "random_forest/clr", "random_forest/rel_abun", "figures/random_forest", "figures/random_forest/single_environment", "rarefy_comp"]
for fn in folder_names:
if not os.path.exists(basedir+"/"+fn):
os.system("mkdir "+basedir+"/"+fn)
if os.path.exists(basedir+'tax_dict.dictionary'):
tax_dict_rare = open_pickle(basedir+'tax_dict.dictionary')
else:
ft, tax_dict_rare = format_R(ft_tax, basedir)
Sort these so that they are all made at once and check if they are already made (just save the files and check if the new ones have been run each time before calling the R function to do the agglomeration and unifrac (if desired): ### (IV) Perform agglomeration and calculate unifrac distances:
#Import data files
if (!file.exists(paste(py$basedir, "agglom/otu_table.csv", sep=""))) {
if (!file.exists(paste(py$basedir, "agglom/reduced_tree.tree", sep=""))) {
asv_table <- read.csv(paste(py$basedir, "feature_table.csv", sep=""), sep=",", row.names=1)
asv_table = as.matrix(asv_table)
phy_tree <- read_tree(paste(py$basedir, "qiime_output/tree.nwk", sep=""))
#Convert these to phyloseq objects
ASV = otu_table(asv_table, taxa_are_rows = TRUE)
physeq = phyloseq(ASV,phy_tree)
#Remove ASVs below 1% relative abundance (without doing this, we have too many taxa to agglomerate and everything will most likely just crash)
rel_abun <- transform_sample_counts(physeq, function(x) x/sum(x))
remove_low_abun <- filter_taxa(rel_abun, function(x) max(x) > 0.01, TRUE)
#See how many ASVs we had before and after removing the low abundance
physeq
remove_low_abun
#Now do the agglomeration - 0.2 will have the least (~2000 from 7401 for the original data), while 0.05 will keep the most tips (~5000). You can find more information on doing this here: https://www.rdocumentation.org/packages/phyloseq/versions/1.16.2/topics/tip_glom
#0.2 is the default value, but for the analyses shown in the paper we have compromised on 0.1, which gives 4469 taxa remaining (tips)
agglom <- tip_glom(remove_low_abun, h=0.1)
#And look at how many taxa we now have:
agglom
tree = phy_tree(agglom)
write_phyloseq(agglom, type="all", path=paste(py$basedir, "agglom/", sep=""))
ape::write.tree(tree, paste(py$basedir, "agglom/reduced_tree.tree", sep=""))
}
}
if (!file.exists(paste(py$basedir, "agglom/weighted_unifrac.csv", sep=""))) {
#read in files again
asv_table <- read.csv(paste(py$basedir, "agglom/otu_table.csv", sep=""), sep=",", row.names=1)
asv_table = as.matrix(asv_table)
phy_tree <- read_tree(paste(py$basedir, "agglom/reduced_tree.tree", sep=""))
ASV = otu_table(asv_table, taxa_are_rows = TRUE)
physeq = phyloseq(ASV,phy_tree)
#Now we can calculate unifrac distances between samples (I have not normalized here as we already have normalized samples with values account for the removed ASVs that are below a maximum of 1% relative abundance)
w_uf <- UniFrac(agglom, weighted=TRUE, normalized=FALSE, fast=TRUE)
w_uf_df <- as.data.frame(as.matrix(w_uf))
#And finally we can write the data to file so that we can continue our analysis in Python
write.csv(w_uf_df, paste(py$basedir, "agglom/weighted_unifrac.csv", sep=""))
}
if (!file.exists(paste(py$basedir, "agglom/unweighted_unifrac.csv", sep=""))) {
#read in files again
asv_table <- read.csv(paste(py$basedir, "agglom/otu_table.csv", sep=""), sep=",", row.names=1)
asv_table = as.matrix(asv_table)
phy_tree <- read_tree(paste(py$basedir, "agglom/reduced_tree.tree", sep=""))
ASV = otu_table(asv_table, taxa_are_rows = TRUE)
physeq = phyloseq(ASV,phy_tree)
#Now we can calculate unifrac distances between samples (I have not normalized here as we already have normalized samples with values account for the removed ASVs that are below a maximum of 1% relative abundance)
uw_uf <- UniFrac(agglom, weighted=FALSE, normalized=FALSE, fast=TRUE)
uw_uf_df <- as.data.frame(as.matrix(w_uf))
#And finally we can write the data to file so that we can continue our analysis in Python
write.csv(uw_uf_df, paste(py$basedir, "agglom/unweighted_unifrac.csv", sep=""))
}
First reformat the raw file for R (while filtering out low abundance taxa):
if os.path.exists(basedir+'rarefy_comp/feature_table.csv'):
print('True')
ft_raw = pd.read_csv(basedir+'rarefy_comp/feature_table.csv', index_col=0, header=0)
else:
print('False')
basedir_2 = basedir+'rarefy_comp/'
ft_tax_2 = basedir+'rarefy_comp/feature-table_w_tax.txt'
ft_raw, tax_dict_raw = format_R(ft_tax_2, basedir_2)
cutoff = np.median(list(ft_raw.sum()))*0.01
ft_raw = ft_raw[ft_raw.max(axis=1) > cutoff]
ft_raw.to_csv(basedir+'rarefy_comp/feature_table_abun_filt.csv')
Now do the agglomeration and unifrac (this step takes a long time so I didn’t want to run it unnecessarily - if you want to run it make eval=TRUE):
asv_table <- read.csv(paste(py$basedir, "rarefy_comp/feature_table_abun_filt.csv", sep=""), sep=",", row.names=1)
asv_table = as.matrix(asv_table)
phy_tree <- read_tree(paste(py$basedir, "rarefy_comp/tree.nwk", sep=""))
ASV <- otu_table(asv_table, taxa_are_rows = TRUE)
physeq <- phyloseq(ASV, phy_tree)
physeq_agglom <- tip_glom(physeq, h=0.1)
tree = phy_tree(physeq_agglom)
ape::write.tree(tree, paste(py$basedir, "rarefy_comp/reduced_tree.tree", sep=""))
write_phyloseq(physeq_agglom, type="all", path=paste(py$basedir, "rarefy_comp/", sep=""))
file.rename(paste(py$basedir, "rarefy_comp/otu_table.csv", sep=""), paste(py$basedir, "rarefy_comp/otu_table_raw.csv", sep=""))
#relative abundance transformation
physeq_ra <- transform_sample_counts(physeq_agglom, function(x) x/sum(x))
write_phyloseq(physeq_ra, type="all", path=paste(py$basedir, "rarefy_comp/", sep=""))
file.rename(paste(py$basedir, "rarefy_comp/otu_table.csv", sep=""), paste(py$basedir, "rarefy_comp/otu_table_relabun.csv", sep=""))
#log transformation (with pseudocount)
physeq_log <- transform_sample_counts(physeq_agglom, function (x) log10(x+1))
write_phyloseq(physeq_log, type="all", path=paste(py$basedir, "rarefy_comp/", sep=""))
file.rename(paste(py$basedir, "rarefy_comp/otu_table.csv", sep=""), paste(py$basedir, "rarefy_comp/otu_table_log.csv", sep=""))
#CLR transformation (with pseudocount)
physeq_clr <- microbiome::transform(physeq_agglom, "clr")
write_phyloseq(physeq_clr, type="all", path=paste(py$basedir, "rarefy_comp/", sep=""))
file.rename(paste(py$basedir, "rarefy_comp/otu_table.csv", sep=""), paste(py$basedir, "rarefy_comp/otu_table_clr.csv", sep=""))
# If you want to also calculate unifrac distances (log and CLR)
# w_uf_log <- UniFrac(physeq_log, weighted=TRUE, normalized=FALSE, fast=TRUE)
# w_uf_log_df <- as.data.frame(as.matrix(w_uf_log))
# write.csv(w_uf_log_df, paste(py$basedir, "rarefy_comp/weighted_unifrac_log.csv", sep=""))
# uw_uf_log <- UniFrac(physeq_log, weighted=FALSE, normalized=FALSE, fast=TRUE)
# uw_uf_log_df <- as.data.frame(as.matrix(uw_uf_log))
# write.csv(uw_uf_log_df, paste(py$basedir, "rarefy_comp/unweighted_unifrac_log.csv", sep=""))
# w_uf_clr <- UniFrac(physeq_clr, weighted=TRUE, normalized=FALSE, fast=TRUE)
# w_uf_clr_df <- as.data.frame(as.matrix(w_uf_clr))
# write.csv(w_uf_clr_df, paste(py$basedir, "rarefy_comp/weighted_unifrac_clr.csv", sep=""))
# uw_uf_clr <- UniFrac(physeq_clr, weighted=FALSE, normalized=FALSE, fast=TRUE)
# uw_uf_clr_df <- as.data.frame(as.matrix(uw_uf_clr))
# write.csv(w_uf_clr_df, paste(py$basedir, "rarefy_comp/unweighted_unifrac_clr.csv", sep=""))
Also change these bits, as right now we don’t have the correct unifrac files: ### (*VI) Read these files into Python again and open the information that we need: Changing the parts that are between the ’’’’’’’ will allow you to use a different normalisation method for all further analyses/plots.
ft_df_rare, tree_agglom_rare = open_and_sort(basedir+'/agglom/otu_table.csv'), basedir+'/agglom/reduced_tree.tree'
ft_df_log = open_and_sort(basedir+'/rarefy_comp/otu_table_log.csv')
ft_df_clr = open_and_sort(basedir+'/rarefy_comp/otu_table_clr.csv')
ft_df_rel_abun = open_and_sort(basedir+'/rarefy_comp/otu_table_relabun.csv')
tree_agglom = basedir+'/rarefy_comp/reduced_tree.tree'
tax_dict_other_norm = open_pickle(basedir+'/rarefy_comp/tax_dict.dictionary')
''''''
ft_df = ft_df_clr
tax_dict = tax_dict_other_norm
meta, meta_names, meta_dict = get_meta(meta_fn)
meta_df = get_meta_df(meta, meta_names, list(ft_df.columns))
ft_full = basedir+'/rarefy_comp/feature_table.csv'
w_uf, uw_uf = open_and_sort(basedir+'/agglom/weighted_unifrac.csv'), open_and_sort(basedir+ '/agglom/unweighted_unifrac.csv') #file names for unifrac distances based on agglomerated data
''''''
if os.path.exists(basedir+'/rarefy_comp/ASV_dict.dictionary'):
ASV_dict = open_pickle(basedir+'/rarefy_comp/ASV_dict.dictionary')
else:
ASV_dict = get_ASV_dict(ft_df, seqs, basedir)
write_pickle(basedir+'/rarefy_comp/ASV_dict.dictionary', ASV_dict)
def filter_seqs(ft, sf):
'''
Function to filter the sequences file to contain only the sequences in the given feature table (used to reduce the PICRUSt analyses to only the important sequences)
Takes as input:
- ft (feature table pandas dataframe with sample anmes as columns and ASVs as rows)
- sf (name of the sequences fasta file that contains sequences for all of the ASVs)
Returns:
- the names of the new files saved - one of the new sequences fasta file and the other of the feature table saved in the correct format for running PICRUSt2
'''
asvs = list(ft.index.values) #get a list of the asv names
seqs_agglom = [] #create a list to add the sequence records to
for record in SeqIO.parse(sf, "fasta"): #open the sequences file and loop through it
if record.id in asvs: #if the record id matches one found in the asv list
seqs_agglom.append(record) #add the record to the list
SeqIO.write(seqs_agglom, "picrust/sequences_agglom.fasta", "fasta") #save the new list of sequence records
ft.to_csv('picrust/feature_table_agglom.txt', sep='\t')
return 'picrust/sequences_agglom.fasta', 'picrust/feature_table_agglom.txt'
def make_KO_dict(fn, basedir):
'''
Function to make a dictionary with the kegg ortholog information in the file
Takes as input:
- fn (the name of the file that contains the kegg ortholog information - if downloaded with the other data then this should be called kegg_list.csv)
Returns:
- two dictionaries, the first with only the kegg orthologs relating to xenobiotic degradation, pathogenesis or antimicrobial resistance and the second with all kegg orthologs
'''
KO_dict, KO_dict_full = {}, {} #set up the two dictionaries
with open(fn, 'rU') as f: #open the .csv file
for row in csv.reader(f): #for each row
if row[0] == 'D': #check we are looking at kegg orthologs and not pathway names, which start with a C
if len(row) > 3: #if the row has had something added to the end
if row[3] == 'Xenobiotics' or row[3] == 'Pathogen' or row[3] == 'Antimicrobial resistance': #if the KO is one of the ones we are interested in
KO_dict[row[1]] = row[2:] #add it to our reduced dictionary
KO_dict_full[row[1]] = row[2:] #add it to the full dictionary
write_pickle(basedir+'KO_dict.dictionary', KO_dict) #write our shortened dictionary to file
return KO_dict, KO_dict_full
def filter_picrust(picrust, KO_dict, KO_dict_full, basedir):
'''
Function to filter the PICRUSt2 output for only the kegg orthologs that we are interested in (i.e. those relating to xenobiotics biodegradation, pathogenesis and antimicrobial resistance)
Takes as input:
- picrust (file name of the unstratified picrust output)
- KO_dict (dictionary with kegg orthologs as keys, containing only the orthologs of interest to us)
- KO_dict_full (dictionary with kegg orthologs as keys, containing all orthologs)
Returns:
- picrust (dataframe containing only those kegg orthologs that we are interested in and that are present in at least one sample)
- KO_dict (updated kegg ortholog dictionary, now containing any custom genes that we manually added to PICRUSt)
'''
picrust = pd.read_csv(picrust, sep='\t', header=0, index_col=0) #read in the picrust file
keeping = [] #set up the list of kegg orthologs to keep
ko = list(picrust.index.values) #get a list of all current kegg orthologs
for k in range(len(ko)): #go through and check whether they are in the dictionary of kegg orthologs that we want to keep
if ko[k] in KO_dict: #if they are
keeping.append(True) #add 'True' to the list
elif ko[k] not in KO_dict_full: #otherwise
if ko[k][0] != 'K': #check whether we are looking at one of the genes that we added in with the HMM
keeping.append(True) #keep it if we are
KO_dict[ko[k]] = [ko[k], 'Xenobiotics'] #and update the dictionary
else: #otherwise
keeping.append(False) #add 'False' to the list
else: #otherwise
keeping.append(False) #add 'False' to the list
picrust = picrust.loc[keeping, :] #only keep the rows with kegg orthologs relating to xenobiotics degradation, pathogenesis or antimicrobial resistance
write_pickle(basedir+'KO_dict.dictionary', KO_dict) #write the new dictionary to file
picrust = picrust[picrust.max(axis=1) > 0] #only keep those lines where PICRUSt2 predicts that this gene was present
write_pickle(basedir+'picrust.dataframe', picrust) #write the new picrust dataframe to file
return picrust, KO_dict
if not os.path.exists(basedir+'picrust/picrust_out/ko_all_metagenome_out/pred_metagenome_unstrat.tsv.gz'):
seqs_agglom_fn, ft_agglom_fn = filter_seqs(ft_df, seqs)
os.system("source activate picrust2")
os.system("picrust2_pipeline.py -s "+seqs_agglom_fn+" -i "+ft_agglom_fn+" -o picrust/picrust_out --custom_trait_tables ko_all.txt --stratified --no_pathways")
if not os.path.exists(basedir+'KO_dict.dictionary'):
ko_file = basedir+'picrust/kegg_list.csv'
KO_dict, KO_dict_full = make_KO_dict(ko_file, basedir)
else:
KO_dict = open_pickle(basedir+'KO_dict.dictionary')
if not os.path.exists(basedir+'picrust.dataframe'):
picrust_file = 'picrust/picrust_out/ko_all_metagenome_out/pred_metagenome_unstrat.tsv.gz'
picrust = pd.read_csv(basedir+'/'+picrust_file, sep='\t', header=0, index_col=0)
picrust, KO_dict = filter_picrust(basedir+picrust_file, KO_dict, KO_dict_full, basedir)
else:
picrust = open_pickle(basedir+'picrust.dataframe')
Proabably remove, but first check that it will run without.
if not os.path.exists(basedir+'/figures/Fig1_map_metrics'+ext):
study_map_metrics(study_dates, study_locs, basedir, map_img, meta_df)
This should save Figure 1 as ‘figures/Fig1_map_metrics.png’ (if you have changed the extension then this will be different):
if not os.path.exists(basedir+'/figures/Fig2_nmds_overall'+ext):
w_uf.to_csv(basedir+'agglom/sorted_weighted_unifrac.csv')
uw_uf.to_csv(basedir+'agglom/sorted_unweighted_unifrac.csv')
nmds_plot_study_env('agglom/sorted_weighted_unifrac.csv', 'agglom/sorted_unweighted_unifrac.csv', meta_dict, basedir)
This should save Figure 2 as ‘figures/Fig2_nmds_overall.png’ (if you have changed the extension then this will be different): ### (IX) Get the heatmap using weighted and unweighted unifrac distances that shows the average distance within and between studies (Figure 3):
if not os.path.exists(basedir+'/figures/Fig3_unifrac_heatmap_combined'+ext):
similarity_heatmap_combined(w_uf, uw_uf, basedir)
This should save Figure 3 as ‘figures/Fig3_unifrac_heatmap_combined.png’ (if you have changed the extension then this will be different):